Metric: other (higher is better)
| # | Model↕ | other▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | mPLUG-Huge | 77.02 | No | mPLUG: Effective and Efficient Vision-Language L... | 2022-05-24 | Code |
| 2 | ONE-PEACE | 74.15 | No | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 3 | OFA | 73.35 | No | OFA: Unifying Architectures, Tasks, and Modaliti... | 2022-02-07 | Code |
| 4 | VLMo | 72.87 | No | VLMo: Unified Vision-Language Pre-Training with ... | 2021-11-03 | Code |
| 5 | Prismer | 69.7 | No | Prismer: A Vision-Language Model with Multi-Task... | 2023-03-04 | Code |
| 6 | MSR + MS Cog. Svcs., X10 models | 67.87 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 7 | MSR + MS Cog. Svcs. | 66.68 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 8 | BGN, ensemble | 66.28 | No | Bilinear Graph Networks for Visual Question Answ... | 2019-07-23 | - |
| 9 | ERNIE-ViL-single model | 65.24 | No | ERNIE-ViL: Knowledge Enhanced Vision-Language Re... | 2020-06-30 | - |
| 10 | Single, w/o VLP | 64.77 | No | In Defense of Grid Features for Visual Question ... | 2020-01-10 | Code |
| 11 | Single, w/o VLP | 63.78 | No | Deep Multimodal Neural Architecture Search | 2020-04-25 | Code |