TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VQA: Visual Question Answering

VQA: Visual Question Answering

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh

2015-05-03ICCV 2015 12Image CaptioningVisual Question Answering (VQA)Multiple-choiceVisual Question Answering
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) abstract 1.0 multiple choicePercentage correct71.18Dualnet ensemble
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) abstract 1.0 multiple choicePercentage correct69.21LSTM + global features
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) abstract 1.0 multiple choicePercentage correct61.41LSTM blind
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) real images 2.0 open endedPercentage correct68.16HDU-USYD-UNCC
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) real images 2.0 open endedPercentage correct68.07DLAIT
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) real images 1.0 multiple choicePercentage correct63.1LSTM Q+I
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) real images 1.0 open endedPercentage correct58.2LSTM Q+I
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) abstract images 1.0 open endedPercentage correct69.73Dualnet ensemble
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) abstract images 1.0 open endedPercentage correct65.02LSTM + global features
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) abstract images 1.0 open endedPercentage correct57.19LSTM blind

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09