Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Soravit Changpinyo, Bo Pang, Piyush Sharma, Radu Soricut

2019-09-04IJCNLP 2019 11Question Answering Transfer Learning Image Captioning Visual Question Answering (VQA)object-detection Object Detection Visual Question Answering

Paper PDF

Abstract

Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine the effect of decoupling box proposal and featurization for down-stream tasks. The key insight is that this allows us to leverage a large amount of labeled annotations that were previously unavailable for standard object detection benchmarks. Empirically, we demonstrate that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publicly available benchmarks.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	VizWiz 2018	number	28.81	B-Ultra
Visual Question Answering (VQA)	VizWiz 2018	other	35.41	B-Ultra
Visual Question Answering (VQA)	VizWiz 2018	overall	53.68	B-Ultra
Visual Question Answering (VQA)	VizWiz 2018	unanswerable	84.03	B-Ultra
Visual Question Answering (VQA)	VizWiz 2018	yes/no	68.12	B-Ultra

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17