TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Coarse-to-Fine Vision-Language Pre-training with Fusion in...

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, JianFeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann Lecun, Nanyun Peng, Jianfeng Gao, Lijuan Wang

2022-06-15NeurIPS 2022 5Question AnsweringDescribed Object DetectionImage-text RetrievalText RetrievalReferring Expression ComprehensionImage CaptioningVisual ReasoningPhrase GroundingVisual Question Answering (VQA)Object DetectionVisual Question Answering
PaperPDFCode(official)

Abstract

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.

Results

TaskDatasetMetricValueModel
Phrase GroundingFlickr30k Entities DevR@187.1Fiber-B
Phrase GroundingFlickr30k Entities DevR@1097.4Fiber-B
Phrase GroundingFlickr30k Entities DevR@596.1Fiber-B
Phrase GroundingFlickr30k Entities TestR@187.4FIBER-B
Phrase GroundingFlickr30k Entities TestR@1097.6FIBER-B
Phrase GroundingFlickr30k Entities TestR@596.4FIBER-B
Object DetectionCOCO-OAverage mAP33.7FIBER-B (Swin-B)
Object DetectionCOCO-OEffective Robustness11.43FIBER-B (Swin-B)
Object DetectionDescription Detection DatasetIntra-scenario ABS mAP26FIBER-B
Object DetectionDescription Detection DatasetIntra-scenario FULL mAP22.7FIBER-B
Object DetectionDescription Detection DatasetIntra-scenario PRES mAP21.5FIBER-B
3DCOCO-OAverage mAP33.7FIBER-B (Swin-B)
3DCOCO-OEffective Robustness11.43FIBER-B (Swin-B)
3DDescription Detection DatasetIntra-scenario ABS mAP26FIBER-B
3DDescription Detection DatasetIntra-scenario FULL mAP22.7FIBER-B
3DDescription Detection DatasetIntra-scenario PRES mAP21.5FIBER-B
2D ClassificationCOCO-OAverage mAP33.7FIBER-B (Swin-B)
2D ClassificationCOCO-OEffective Robustness11.43FIBER-B (Swin-B)
2D ClassificationDescription Detection DatasetIntra-scenario ABS mAP26FIBER-B
2D ClassificationDescription Detection DatasetIntra-scenario FULL mAP22.7FIBER-B
2D ClassificationDescription Detection DatasetIntra-scenario PRES mAP21.5FIBER-B
2D Object DetectionCOCO-OAverage mAP33.7FIBER-B (Swin-B)
2D Object DetectionCOCO-OEffective Robustness11.43FIBER-B (Swin-B)
2D Object DetectionDescription Detection DatasetIntra-scenario ABS mAP26FIBER-B
2D Object DetectionDescription Detection DatasetIntra-scenario FULL mAP22.7FIBER-B
2D Object DetectionDescription Detection DatasetIntra-scenario PRES mAP21.5FIBER-B
16kCOCO-OAverage mAP33.7FIBER-B (Swin-B)
16kCOCO-OEffective Robustness11.43FIBER-B (Swin-B)
16kDescription Detection DatasetIntra-scenario ABS mAP26FIBER-B
16kDescription Detection DatasetIntra-scenario FULL mAP22.7FIBER-B
16kDescription Detection DatasetIntra-scenario PRES mAP21.5FIBER-B

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17