| 1 | PaLI | 84.3 | No | PaLI: A Jointly-Scaled Multilingual Language-Ima... | 2022-09-14 | Code |
| 2 | BEiT-3 | 84.19 | No | Image as a Foreign Language: BEiT Pretraining fo... | 2022-08-22 | Code |
| 3 | VLMo | 82.78 | No | VLMo: Unified Vision-Language Pre-Training with ... | 2021-11-03 | Code |
| 4 | ONE-PEACE | 82.6 | No | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 5 | mPLUG (Huge) | 82.43 | No | mPLUG: Effective and Efficient Vision-Language L... | 2022-05-24 | Code |
| 6 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) | 82.3 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 7 | CoCa | 82.3 | No | CoCa: Contrastive Captioners are Image-Text Foun... | 2022-05-04 | Code |
| 8 | CuMo-7B | 82.2 | Yes | CuMo: Scaling Multimodal LLM with Co-Upcycled Mi... | 2024-05-09 | Code |
| 9 | OFA | 82 | No | OFA: Unifying Architectures, Tasks, and Modaliti... | 2022-02-07 | Code |
| 10 | X2-VLM (large) | 81.9 | No | X$^2$-VLM: All-In-One Pre-trained Model For Visi... | 2022-11-22 | Code |
| 11 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) | 81.74 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 12 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) | 81.66 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 13 | MMU | 81.26 | No | Achieving Human Parity on Visual Question Answer... | 2021-11-17 | - |
| 14 | Lyrics | 81.2 | No | Lyrics: Boosting Fine-grained Language-Vision Al... | 2023-12-08 | - |
| 15 | InternVL-C | 81.2 | No | InternVL: Scaling up Vision Foundation Models an... | 2023-12-21 | Code |
| 16 | mPLUG-2 | 81.11 | No | mPLUG-2: A Modularized Multi-modal Foundation Mo... | 2023-02-01 | Code |
| 17 | X2-VLM (base) | 80.4 | No | X$^2$-VLM: All-In-One Pre-trained Model For Visi... | 2022-11-22 | Code |
| 18 | XFM (base) | 80.4 | No | Toward Building General Foundation Models for La... | 2023-01-12 | Code |
| 19 | VAST | 80.23 | Yes | - | - | - |
| 20 | Florence | 80.16 | No | Florence: A New Foundation Model for Computer Vi... | 2021-11-22 | Code |
| 21 | SimVLM | 80.03 | No | SimVLM: Simple Visual Language Model Pretraining... | 2021-08-24 | Code |
| 22 | VALOR | 78.46 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |
| 23 | Prismer | 78.43 | No | Prismer: A Vision-Language Model with Multi-Task... | 2023-03-04 | Code |
| 24 | X-VLM (base) | 78.22 | No | Multi-Grained Vision Language Pre-Training: Alig... | 2021-11-16 | Code |
| 25 | VK-OOD | 77.9 | No | - | - | Code |
| 26 | Aurora (ours, r=64) | 77.69 | No | - | - | - |
| 27 | VK-OOD | 76.8 | No | Differentiable Outlier Detection Enable Robust D... | 2023-02-11 | Code |
| 28 | ALBEF (14M) | 75.84 | No | Align before Fuse: Vision and Language Represent... | 2021-07-16 | Code |
| 29 | Oscar | 73.82 | No | Oscar: Object-Semantics Aligned Pre-training for... | 2020-04-13 | Code |
| 30 | UNITER (Large) | 73.24 | No | UNITER: UNiversal Image-TExt Representation Lear... | 2019-09-25 | Code |
| 31 | X-101 grid features + MCAN | 72.59 | No | In Defense of Grid Features for Visual Question ... | 2020-01-10 | Code |
| 32 | CFR | 72.5 | No | Coarse-to-Fine Reasoning for Visual Question Ans... | 2021-10-06 | Code |
| 33 | VL-BERTLARGE | 71.79 | No | VL-BERT: Pre-training of Generic Visual-Linguist... | 2019-08-22 | Code |
| 34 | ViLT-B/32 | 71.26 | No | ViLT: Vision-and-Language Transformer Without Co... | 2021-02-05 | Code |
| 35 | MCAN+VC | 71.21 | No | Visual Commonsense R-CNN | 2020-02-27 | Code |
| 36 | VL-BERTBASE | 71.16 | No | VL-BERT: Pre-training of Generic Visual-Linguist... | 2019-08-22 | Code |
| 37 | VisualBERT | 70.8 | No | VisualBERT: A Simple and Performant Baseline for... | 2019-08-09 | Code |
| 38 | LXMERT (low-magnitude pruning) | 70.72 | No | LXMERT Model Compression for Visual Question Ans... | 2023-10-23 | Code |
| 39 | MCANed-6 | 70.63 | No | Deep Modular Co-Attention Networks for Visual Qu... | 2019-06-25 | Code |
| 40 | ViLBERT | 70.55 | No | ViLBERT: Pretraining Task-Agnostic Visiolinguist... | 2019-08-06 | Code |
| 41 | BAN+Glove+Counter | 70.04 | No | Bilinear Attention Networks | 2018-05-21 | Code |
| 42 | LXMERT (Pre-train + scratch) | 69.9 | No | LXMERT: Learning Cross-Modality Encoder Represen... | 2019-08-20 | Code |
| 43 | Image features from bottom-up attention (adaptive K, ensemble) | 69.87 | No | Tips and Tricks for Visual Question Answering: L... | 2017-08-09 | Code |
| 44 | Pythia v0.3 + LoRRA | 69.21 | No | Towards VQA Models That Can Read | 2019-04-18 | Code |
| 45 | DMN | 68.09 | No | Learning to Count Objects in Natural Images for ... | 2018-02-15 | Code |
| 46 | LaKo | 68.07 | No | LaKo: Knowledge-driven Visual Question Answering... | 2022-07-26 | Code |
| 47 | MuRel | 68.03 | No | MUREL: Multimodal Relational Reasoning for Visua... | 2019-02-25 | Code |
| 48 | BLOCK | 67.58 | No | BLOCK: Bilinear Superdiagonal Fusion for Visual ... | 2019-01-31 | Code |
| 49 | MUTAN | 67.42 | No | MUTAN: Multimodal Tucker Fusion for Visual Quest... | 2017-05-18 | Code |
| 50 | BAN2-CTI | 67.4 | No | Compact Trilinear Interaction for Visual Questio... | 2019-09-26 | Code |
| 51 | 2D continuous softmax | 65.96 | No | Sparse and Continuous Attention Mechanisms | 2020-06-12 | Code |
| 52 | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | 65 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 53 | N2NMN (ResNet-152, policy search) | 64.9 | No | Learning to Reason: End-to-End Module Networks f... | 2017-04-18 | Code |
| 54 | PNP-VQA | 64.8 | No | Plug-and-Play VQA: Zero-shot VQA by Conjoining L... | 2022-10-17 | Code |
| 55 | MCB | 64.7 | No | Multimodal Compact Bilinear Pooling for Visual Q... | 2016-06-06 | Code |
| 56 | RUBi | 63.18 | No | RUBi: Reducing Unimodal Biases in Visual Questio... | 2019-06-24 | Code |
| 57 | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 63 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 58 | BLIP-2 ViT-L FlanT5 XL (zero-shot) | 62.3 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 59 | Flamingo 80B | 56.3 | No | Flamingo: a Visual Language Model for Few-Shot L... | 2022-04-29 | Code |
| 60 | LocVLM-L | 56.2 | No | Learning to Localize Objects Improves Spatial Re... | 2024-04-11 | Code |
| 61 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 52.6 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 62 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 52.3 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 63 | Flamingo 9B | 51.8 | No | Flamingo: a Visual Language Model for Few-Shot L... | 2022-04-29 | Code |
| 64 | KOSMOS-1 1.6B (zero-shot) | 51 | No | - | - | - |
| 65 | BLIP-2 ViT-L OPT 2.7B (zero-shot) | 49.7 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 66 | Flamingo 3B | 49.2 | No | Flamingo: a Visual Language Model for Few-Shot L... | 2022-04-29 | Code |
| 67 | VLKD | 44.5 | No | - | - | - |