An Empirical Study of Training End-to-End Vision-and-Language Transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, JianFeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng

2021-11-03CVPR 2022 1Cross-Modal Retrieval Visual Reasoning Visual Question Answering (VQA)

Paper PDF Code Code Code(official)

Abstract

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer. METER achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy of 80.54%. Code and pre-trained models are released at https://github.com/zdou0830/METER.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	76.16	METER
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	96.82	METER
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	93.16	METER
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	57.08	METER
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	90.07	METER
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	82.66	METER
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@1	76.16	METER
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@10	96.82	METER
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@5	93.16	METER
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@1	57.08	METER
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@10	90.07	METER
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	82.66	METER
Cross-Modal Retrieval	COCO 2014	Image-to-text R@1	76.16	METER
Cross-Modal Retrieval	COCO 2014	Image-to-text R@10	96.82	METER
Cross-Modal Retrieval	COCO 2014	Image-to-text R@5	93.16	METER
Cross-Modal Retrieval	COCO 2014	Text-to-image R@1	57.08	METER
Cross-Modal Retrieval	COCO 2014	Text-to-image R@10	90.07	METER
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	82.66	METER

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Abstract

Results

Related Papers

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Abstract

Results

Related Papers