TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Simple Baseline for Visual Question Answering

Simple Baseline for Visual Question Answering

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus

2015-12-07Visual Question Answering (VQA)Visual Question Answering
PaperPDFCodeCodeCodeCodeCodeCode(official)Code

Abstract

We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) real images 1.0 multiple choicePercentage correct62iBOWIMG baseline
Visual Question Answering (VQA)COCO Visual Question Answering (VQA) real images 1.0 open endedPercentage correct55.9iBOWIMG baseline

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling2025-07-08