Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 84.19 | BEiT-3 |
| Visual Question Answering (VQA) | VQA v2 test-std | overall | 84.03 | BEiT-3 |
| Visual Reasoning | NLVR2 Dev | Accuracy | 91.51 | BEiT-3 |
| Visual Reasoning | NLVR2 Test | Accuracy | 92.58 | BEiT-3 |
| Semantic Segmentation | ADE20K val | mIoU | 62.8 | BEiT-3 |
| Semantic Segmentation | ADE20K | Params (M) | 1900 | BEiT-3 |
| Semantic Segmentation | ADE20K | Validation mIoU | 62.8 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 98 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 100 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 100 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 90.3 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 99.5 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 98.7 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 84.8 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 98.3 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 96.5 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 67.2 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 87.7 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 92.8 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 94.9 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 100 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 99.9 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 81.5 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 97.8 | BEiT-3 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 95.6 | BEiT-3 |
| Object Detection | COCO test-dev | box mAP | 63.7 | BEiT-3 |
| 3D | COCO test-dev | box mAP | 63.7 | BEiT-3 |
| Instance Segmentation | COCO test-dev | mask AP | 54.8 | BEiT-3 |
| 2D Classification | COCO test-dev | box mAP | 63.7 | BEiT-3 |
| 2D Object Detection | COCO test-dev | box mAP | 63.7 | BEiT-3 |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@1 | 98 | BEiT-3 |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@10 | 100 | BEiT-3 |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@5 | 100 | BEiT-3 |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@1 | 90.3 | BEiT-3 |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@10 | 99.5 | BEiT-3 |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@5 | 98.7 | BEiT-3 |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@1 | 84.8 | BEiT-3 |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@10 | 98.3 | BEiT-3 |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@5 | 96.5 | BEiT-3 |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@1 | 67.2 | BEiT-3 |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@10 | 87.7 | BEiT-3 |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 92.8 | BEiT-3 |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@1 | 98 | BEiT-3 |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@10 | 100 | BEiT-3 |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@5 | 100 | BEiT-3 |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@1 | 90.3 | BEiT-3 |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@10 | 99.5 | BEiT-3 |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@5 | 98.7 | BEiT-3 |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@1 | 84.8 | BEiT-3 |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@10 | 98.3 | BEiT-3 |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@5 | 96.5 | BEiT-3 |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@1 | 67.2 | BEiT-3 |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@10 | 87.7 | BEiT-3 |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 92.8 | BEiT-3 |
| 10-shot image generation | ADE20K val | mIoU | 62.8 | BEiT-3 |
| 10-shot image generation | ADE20K | Params (M) | 1900 | BEiT-3 |
| 10-shot image generation | ADE20K | Validation mIoU | 62.8 | BEiT-3 |
| 16k | COCO test-dev | box mAP | 63.7 | BEiT-3 |