Simple Baseline for Visual Question Answering

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus

2015-12-07Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code Code Code Code Code Code(official)Code

Abstract

We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 multiple choice	Percentage correct	62	iBOWIMG baseline
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 open ended	Percentage correct	55.9	iBOWIMG baseline

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09 LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09 Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09 Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling2025-07-08