Metric: number (higher is better)
| # | Model↕ | number▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | ONE-PEACE | 72.24 | No | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 2 | OFA | 71.44 | No | OFA: Unifying Architectures, Tasks, and Modaliti... | 2022-02-07 | Code |
| 3 | mPLUG-Huge | 69.82 | No | mPLUG: Effective and Efficient Vision-Language L... | 2022-05-24 | Code |
| 4 | VLMo | 67.26 | No | VLMo: Unified Vision-Language Pre-Training with ... | 2021-11-03 | Code |
| 5 | MSR + MS Cog. Svcs., X10 models | 62.55 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 6 | MSR + MS Cog. Svcs. | 61.5 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 7 | Prismer | 61.39 | No | Prismer: A Vision-Language Model with Multi-Task... | 2023-03-04 | Code |
| 8 | BGN, ensemble | 61.13 | No | Bilinear Graph Networks for Visual Question Answ... | 2019-07-23 | - |
| 9 | Single, w/o VLP | 58.62 | No | Deep Multimodal Neural Architecture Search | 2020-04-25 | Code |
| 10 | Single, w/o VLP | 58.01 | No | In Defense of Grid Features for Visual Question ... | 2020-01-10 | Code |
| 11 | ERNIE-ViL-single model | 56.79 | No | ERNIE-ViL: Knowledge Enhanced Vision-Language Re... | 2020-06-30 | - |