TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PromptCap: Prompt-Guided Task-Aware Image Captioning

PromptCap: Prompt-Guided Task-Aware Image Captioning

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, Jiebo Luo

2022-11-15Question AnsweringImage CaptioningWorld KnowledgeRetrievalVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCode(official)

Abstract

Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)A-OKVQADA VQA Score59.6PromptCap
Visual Question Answering (VQA)A-OKVQAMC Accuracy73.2PromptCap
Visual Question Answering (VQA)OK-VQAAccuracy60.4PromptCap
Visual Question Answering (VQA)TextVQA test-standardoverall51.8PromptCap
Visual Question AnsweringTextVQA test-standardoverall51.8PromptCap

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17