TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/An Empirical Study of Training End-to-End Vision-and-Langu...

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, JianFeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng

2021-11-03CVPR 2022 1Cross-Modal RetrievalVisual ReasoningVisual Question Answering (VQA)
PaperPDFCodeCodeCode(official)

Abstract

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer. METER achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy of 80.54%. Code and pre-trained models are released at https://github.com/zdou0830/METER.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@176.16METER
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1096.82METER
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@593.16METER
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@157.08METER
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1090.07METER
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@582.66METER
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@176.16METER
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1096.82METER
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@593.16METER
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@157.08METER
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1090.07METER
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@582.66METER
Cross-Modal RetrievalCOCO 2014Image-to-text R@176.16METER
Cross-Modal RetrievalCOCO 2014Image-to-text R@1096.82METER
Cross-Modal RetrievalCOCO 2014Image-to-text R@593.16METER
Cross-Modal RetrievalCOCO 2014Text-to-image R@157.08METER
Cross-Modal RetrievalCOCO 2014Text-to-image R@1090.07METER
Cross-Modal RetrievalCOCO 2014Text-to-image R@582.66METER

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09