TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/LXMERT

LXMERT

Learning Cross-Modality Encoder Representations from Transformers

Computer VisionIntroduced 200040 papers
Source Paper

Description

LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.

Papers Using This Method

Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns2024-06-13Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking2024-01-29LXMERT Model Compression for Visual Question Answering2023-10-23Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models2023-08-18Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving2023-07-18An Empirical Study on the Language Modal in Visual Question Answering2023-05-17Probing the Role of Positional Information in Vision-Language Models2023-05-17Controlling for Stereotypes in Multimodal Language Model Evaluation2023-02-03MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks2022-12-15Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference2022-11-21Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering2022-10-26Probing Cross-modal Semantics Alignment Capability from the Textual Perspective2022-10-18Generative Bias for Robust Visual Question Answering2022-08-01VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations2022-07-01Visual Spatial Reasoning2022-04-30Visio-Linguistic Brain Encoding2022-04-18SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering2022-04-05Exploring Multi-Modal Representations for Ambiguity Detection & Coreference Resolution in the SIMMC 2.0 Challenge2022-02-25Probing the Role of Positional Information in Vision-Language Models2022-01-16Multimodal Learning: Are Captions All You Need?2021-11-16