Metric: yes/no (higher is better)
| # | Model↕ | yes/no▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | ONE-PEACE | 94.85 | No | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 2 | mPLUG-Huge | 94.83 | No | mPLUG: Effective and Efficient Vision-Language L... | 2022-05-24 | Code |
| 3 | VLMo | 94.68 | No | VLMo: Unified Vision-Language Pre-Training with ... | 2021-11-03 | Code |
| 4 | OFA | 94.66 | No | OFA: Unifying Architectures, Tasks, and Modaliti... | 2022-02-07 | Code |
| 5 | Prismer | 93.09 | No | Prismer: A Vision-Language Model with Multi-Task... | 2023-03-04 | Code |
| 6 | MSR + MS Cog. Svcs., X10 models | 92.38 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 7 | MSR + MS Cog. Svcs. | 92.04 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 8 | BGN, ensemble | 90.89 | No | Bilinear Graph Networks for Visual Question Answ... | 2019-07-23 | - |
| 9 | ERNIE-ViL-single model | 90.83 | No | ERNIE-ViL: Knowledge Enhanced Vision-Language Re... | 2020-06-30 | - |
| 10 | Single, w/o VLP | 89.46 | No | Deep Multimodal Neural Architecture Search | 2020-04-25 | Code |
| 11 | Single, w/o VLP | 89.18 | No | In Defense of Grid Features for Visual Question ... | 2020-01-10 | Code |