TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Retrieval Augmented Visual Question Answering with Outside...

Retrieval Augmented Visual Question Answering with Outside Knowledge

Weizhe Lin, Bill Byrne

2022-10-07Question AnsweringPassage RetrievalDiagnosticRetrievalVisual Question Answering (VQA)Answer GenerationVisual Question Answering
PaperPDFCode(official)

Abstract

Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA task that requires retrieval of external knowledge to answer questions about images. Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve documents from external knowledge bases, such as Wikipedia, but with DPR trained separately from answer generation, introducing a potential limit on the overall system performance. Instead, we propose a joint training scheme which includes differentiable DPR integrated with answer generation so that the system can be trained in an end-to-end fashion. Our experiments show that our scheme outperforms recent OK-VQA systems with strong DPR for retrieval. We also introduce new diagnostic metrics to analyze how retrieval and generation interact. The strong retrieval ability of our model significantly reduces the number of retrieved documents needed in training, yielding significant benefits in answer quality and computation required for training.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)OK-VQAAccuracy54.48RA-VQA (T5-large)
Visual Question Answering (VQA)OK-VQAExact Match (EM)59.41RA-VQA (T5-large)
Visual Question Answering (VQA)OK-VQARecall@582.84RA-VQA (T5-large)
Visual Question Answering (VQA)OK-VQAAccuracy51.22RA-VQA-FrDPR (T5-large)
Visual Question Answering (VQA)OK-VQAExact Match (EM)55.77RA-VQA-FrDPR (T5-large)
Visual Question Answering (VQA)OK-VQARecall@581.25RA-VQA-FrDPR (T5-large)
RetrievalOK-VQARecall@582.84RA-VQA

Related Papers

Smart fault detection in satellite electrical power system2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Demographic-aware fine-grained classification of pediatric wrist fractures2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17