Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/LXMERT

LXMERT

Learning Cross-Modality Encoder Representations from Transformers

Computer VisionIntroduced 200040 papers

Description

LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.

Papers Using This Method

Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns2024-06-13 Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking2024-01-29 LXMERT Model Compression for Visual Question Answering2023-10-23 Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models2023-08-18 Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving2023-07-18 An Empirical Study on the Language Modal in Visual Question Answering2023-05-17 Probing the Role of Positional Information in Vision-Language Models2023-05-17 Controlling for Stereotypes in Multimodal Language Model Evaluation2023-02-03 MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks2022-12-15 Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference2022-11-21 Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering2022-10-26 Probing Cross-modal Semantics Alignment Capability from the Textual Perspective2022-10-18 Generative Bias for Robust Visual Question Answering2022-08-01 VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations2022-07-01 Visual Spatial Reasoning2022-04-30 Visio-Linguistic Brain Encoding2022-04-18 SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering2022-04-05 Exploring Multi-Modal Representations for Ambiguity Detection & Coreference Resolution in the SIMMC 2.0 Challenge2022-02-25 Probing the Role of Positional Information in Vision-Language Models2022-01-16 Multimodal Learning: Are Captions All You Need?2021-11-16