Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, JianFeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann Lecun, Nanyun Peng, Jianfeng Gao, Lijuan Wang

2022-06-15NeurIPS 2022 5Question Answering Described Object Detection Image-text Retrieval Text Retrieval Referring Expression Comprehension Image Captioning Visual Reasoning Phrase Grounding Visual Question Answering (VQA)Object Detection Visual Question Answering

Paper PDF Code(official)

Abstract

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.

Results

Task	Dataset	Metric	Value	Model
Phrase Grounding	Flickr30k Entities Dev	R@1	87.1	Fiber-B
Phrase Grounding	Flickr30k Entities Dev	R@10	97.4	Fiber-B
Phrase Grounding	Flickr30k Entities Dev	R@5	96.1	Fiber-B
Phrase Grounding	Flickr30k Entities Test	R@1	87.4	FIBER-B
Phrase Grounding	Flickr30k Entities Test	R@10	97.6	FIBER-B
Phrase Grounding	Flickr30k Entities Test	R@5	96.4	FIBER-B
Object Detection	COCO-O	Average mAP	33.7	FIBER-B (Swin-B)
Object Detection	COCO-O	Effective Robustness	11.43	FIBER-B (Swin-B)
Object Detection	Description Detection Dataset	Intra-scenario ABS mAP	26	FIBER-B
Object Detection	Description Detection Dataset	Intra-scenario FULL mAP	22.7	FIBER-B
Object Detection	Description Detection Dataset	Intra-scenario PRES mAP	21.5	FIBER-B
3D	COCO-O	Average mAP	33.7	FIBER-B (Swin-B)
3D	COCO-O	Effective Robustness	11.43	FIBER-B (Swin-B)
3D	Description Detection Dataset	Intra-scenario ABS mAP	26	FIBER-B
3D	Description Detection Dataset	Intra-scenario FULL mAP	22.7	FIBER-B
3D	Description Detection Dataset	Intra-scenario PRES mAP	21.5	FIBER-B
2D Classification	COCO-O	Average mAP	33.7	FIBER-B (Swin-B)
2D Classification	COCO-O	Effective Robustness	11.43	FIBER-B (Swin-B)
2D Classification	Description Detection Dataset	Intra-scenario ABS mAP	26	FIBER-B
2D Classification	Description Detection Dataset	Intra-scenario FULL mAP	22.7	FIBER-B
2D Classification	Description Detection Dataset	Intra-scenario PRES mAP	21.5	FIBER-B
2D Object Detection	COCO-O	Average mAP	33.7	FIBER-B (Swin-B)
2D Object Detection	COCO-O	Effective Robustness	11.43	FIBER-B (Swin-B)
2D Object Detection	Description Detection Dataset	Intra-scenario ABS mAP	26	FIBER-B
2D Object Detection	Description Detection Dataset	Intra-scenario FULL mAP	22.7	FIBER-B
2D Object Detection	Description Detection Dataset	Intra-scenario PRES mAP	21.5	FIBER-B
16k	COCO-O	Average mAP	33.7	FIBER-B (Swin-B)
16k	COCO-O	Effective Robustness	11.43	FIBER-B (Swin-B)
16k	Description Detection Dataset	Intra-scenario ABS mAP	26	FIBER-B
16k	Description Detection Dataset	Intra-scenario FULL mAP	22.7	FIBER-B
16k	Description Detection Dataset	Intra-scenario PRES mAP	21.5	FIBER-B

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Abstract

Results

Related Papers

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Abstract

Results

Related Papers