Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and pre-trained models are available at https://github.com/salesforce/ALBEF/.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 75.84 | ALBEF (14M) |
| Visual Question Answering (VQA) | VQA v2 test-std | overall | 76.04 | ALBEF (14M) |
| Visual Reasoning | NLVR2 Dev | Accuracy | 83.14 | ALBEF (14M) |
| Visual Reasoning | NLVR2 Test | Accuracy | 82.55 | ALBEF (14M) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 77.6 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 97.2 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 94.3 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 60.7 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 90.5 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 84.3 | ALBEF |
| Image Retrieval with Multi-Modal Query | CommercialAdsDataset | ADD(S) AUC | 82.74 | ALBEF |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 90.5 | ALBEF |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 99.7 | ALBEF |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 98.8 | ALBEF |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 76.8 | ALBEF |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 96.7 | ALBEF |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 93.7 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 68.7 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 94.7 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 89.5 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 50.1 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 84.5 | ALBEF |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 76.4 | ALBEF |
| Object Detection | OVAD-Box benchmark | mean average precision | 21 | ALBEF |
| 3D | OVAD-Box benchmark | mean average precision | 21 | ALBEF |
| 2D Classification | OVAD-Box benchmark | mean average precision | 21 | ALBEF |
| 2D Object Detection | OVAD-Box benchmark | mean average precision | 21 | ALBEF |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@1 | 77.6 | ALBEF |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@10 | 97.2 | ALBEF |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@5 | 94.3 | ALBEF |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@1 | 60.7 | ALBEF |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@10 | 90.5 | ALBEF |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 84.3 | ALBEF |
| Cross-Modal Information Retrieval | CommercialAdsDataset | ADD(S) AUC | 82.74 | ALBEF |
| Open Vocabulary Object Detection | OVAD-Box benchmark | mean average precision | 21 | ALBEF |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@1 | 77.6 | ALBEF |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@10 | 97.2 | ALBEF |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@5 | 94.3 | ALBEF |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@1 | 60.7 | ALBEF |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@10 | 90.5 | ALBEF |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 84.3 | ALBEF |
| Cross-Modal Retrieval | CommercialAdsDataset | ADD(S) AUC | 82.74 | ALBEF |
| Image-to-Text Retrieval | Flickr30k | Recall@1 | 95.9 | ALBEF |
| Image-to-Text Retrieval | Flickr30k | Recall@10 | 100 | ALBEF |
| Image-to-Text Retrieval | Flickr30k | Recall@5 | 99.8 | ALBEF |
| 16k | OVAD-Box benchmark | mean average precision | 21 | ALBEF |