Metric: Avg. Accuracy (higher is better)
| # | Model↕ | Avg. Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | MC-CoT F-Large | 94.88 | No | Boosting the Power of Small Multimodal Reasoning... | 2023-11-23 | Code |
| 2 | Honeybee | 94.39 | Yes | Honeybee: Locality-enhanced Projector for Multim... | 2023-12-11 | Code |
| 3 | LLaVA (+ GPT-4) | 92.53 | Yes | - | - | - |
| 4 | Multimodal CoT | 91.68 | No | Multimodal Chain-of-Thought Reasoning in Languag... | 2023-02-02 | Code |
| 5 | Chat-UniVi-13B | 90.99 | Yes | Chat-UniVi: Unified Visual Representation Empowe... | 2023-11-14 | Code |
| 6 | GPT-3 - CoT (QCM→ALE , 2-shot) | 75.17 | No | Learn to Explain: Multimodal Reasoning via Thoug... | 2022-09-20 | Code |
| 7 | GPT-3 - CoT(QCM→AE, 2-shot) | 74.61 | No | Learn to Explain: Multimodal Reasoning via Thoug... | 2022-09-20 | Code |
| 8 | UnifiedQA-BASE - CoT (QCM→ALE) | 74.11 | No | Learn to Explain: Multimodal Reasoning via Thoug... | 2022-09-20 | Code |
| 9 | GPT-3 (QCM→A, 2-shot) | 73.97 | No | Learn to Explain: Multimodal Reasoning via Thoug... | 2022-09-20 | Code |
| 10 | Video-LaVIT | 70 | No | Video-LaVIT: Unified Video-Language Pre-training... | 2024-02-05 | Code |