TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Decoupled Box Proposal and Featurization with Ultrafine-Gr...

Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Soravit Changpinyo, Bo Pang, Piyush Sharma, Radu Soricut

2019-09-04IJCNLP 2019 11Question AnsweringTransfer LearningImage CaptioningVisual Question Answering (VQA)object-detectionObject DetectionVisual Question Answering
PaperPDF

Abstract

Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine the effect of decoupling box proposal and featurization for down-stream tasks. The key insight is that this allows us to leverage a large amount of labeled annotations that were previously unavailable for standard object detection benchmarks. Empirically, we demonstrate that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publicly available benchmarks.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VizWiz 2018number28.81B-Ultra
Visual Question Answering (VQA)VizWiz 2018other35.41B-Ultra
Visual Question Answering (VQA)VizWiz 2018overall53.68B-Ultra
Visual Question Answering (VQA)VizWiz 2018unanswerable84.03B-Ultra
Visual Question Answering (VQA)VizWiz 2018yes/no68.12B-Ultra

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17