Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang

2017-07-25CVPR 2018 6Image Captioning Visual Question Answering (VQA)Visual Question Answering

Abstract

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	GQA Test2019	Accuracy	49.74	BottomUp
Visual Question Answering (VQA)	GQA Test2019	Binary	66.64	BottomUp
Visual Question Answering (VQA)	GQA Test2019	Consistency	78.71	BottomUp
Visual Question Answering (VQA)	GQA Test2019	Distribution	5.98	BottomUp
Visual Question Answering (VQA)	GQA Test2019	Open	34.83	BottomUp
Visual Question Answering (VQA)	GQA Test2019	Plausibility	84.57	BottomUp
Visual Question Answering (VQA)	GQA Test2019	Validity	96.18	BottomUp
Visual Question Answering (VQA)	VQA v2 test-std	overall	70.34	Up-Down

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09 LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09 Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09