TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LXMERT: Learning Cross-Modality Encoder Representations fr...

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Hao Tan, Mohit Bansal

2019-08-20IJCNLP 2019 11Question AnsweringMasked Language ModelingVisual ReasoningVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCodeCode(official)CodeCodeCodeCodeCodeCodeCode

Abstract

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pre-trained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pre-trained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pre-training strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders. Code and pre-trained models publicly available at: https://github.com/airsplay/lxmert

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)A-OKVQADA VQA Score25.9LXMERT
Visual Question Answering (VQA)A-OKVQAMC Accuracy41.6LXMERT
Visual Question Answering (VQA)GQA Test2019Accuracy62.71LXR955, Ensemble
Visual Question Answering (VQA)GQA Test2019Binary79.79LXR955, Ensemble
Visual Question Answering (VQA)GQA Test2019Consistency93.1LXR955, Ensemble
Visual Question Answering (VQA)GQA Test2019Distribution6.42LXR955, Ensemble
Visual Question Answering (VQA)GQA Test2019Open47.64LXR955, Ensemble
Visual Question Answering (VQA)GQA Test2019Plausibility85.21LXR955, Ensemble
Visual Question Answering (VQA)GQA Test2019Validity96.36LXR955, Ensemble
Visual Question Answering (VQA)GQA Test2019Accuracy60.33LXR955, Single Model
Visual Question Answering (VQA)GQA Test2019Binary77.16LXR955, Single Model
Visual Question Answering (VQA)GQA Test2019Consistency89.59LXR955, Single Model
Visual Question Answering (VQA)GQA Test2019Distribution5.69LXR955, Single Model
Visual Question Answering (VQA)GQA Test2019Open45.47LXR955, Single Model
Visual Question Answering (VQA)GQA Test2019Plausibility84.53LXR955, Single Model
Visual Question Answering (VQA)GQA Test2019Validity96.35LXR955, Single Model
Visual Question Answering (VQA)GQA test-stdAccuracy60.3LXMERT
Visual Question Answering (VQA)VizWiz 2018number24.76LXR955, No Ensemble
Visual Question Answering (VQA)VizWiz 2018other39LXR955, No Ensemble
Visual Question Answering (VQA)VizWiz 2018overall55.4LXR955, No Ensemble
Visual Question Answering (VQA)VizWiz 2018unanswerable82.26LXR955, No Ensemble
Visual Question Answering (VQA)VizWiz 2018yes/no74LXR955, No Ensemble
Visual Question Answering (VQA)GQA test-devAccuracy60LXMERT (Pre-train + scratch)
Visual Question Answering (VQA)VQA v2 test-devAccuracy69.9LXMERT (Pre-train + scratch)
Visual Question Answering (VQA)VQA v2 test-stdoverall72.5LXMERT
Visual ReasoningNLVR2 DevAccuracy74.9LXMERT (Pre-train + scratch)
Visual ReasoningNLVR2 TestAccuracy76.2LXMERT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17