ALBEF

Computer VisionIntroduced 200017 papers

Description

ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data.

Papers Using This Method

Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses2024-12-11 Nearest Neighbor Normalization Improves Multimodal Retrieval2024-10-31 Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM2024-04-29 Learning from Synthetic Data for Visual Grounding2024-03-20 LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival2024-03-16 Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction2024-03-16 Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models2023-07-26 RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search2023-05-23 MultiModal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision Language Models2023-03-16 Is Multimodal Vision Supervision Beneficial to Language?2023-02-10 MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks2022-12-15 Leveraging per Image-Token Consistency for Vision-Language Pre-training2022-11-20 GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training2022-08-08 ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding2022-08-05 VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations2022-07-01 MixGen: A New Multi-Modal Data Augmentation2022-06-16 Align before Fuse: Vision and Language Representation Learning with Momentum Distillation2021-07-16