Metric: Accuracy (higher is better)
| # | Model↕ | Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) | 82.3 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 2 | CoCa | 82.3 | No | CoCa: Contrastive Captioners are Image-Text Foun... | 2022-05-04 | Code |
| 3 | OFA | 82 | No | OFA: Unifying Architectures, Tasks, and Modaliti... | 2022-02-07 | Code |
| 4 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) | 81.74 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 5 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) | 81.66 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 6 | mPLUG-2 | 81.11 | No | mPLUG-2: A Modularized Multi-modal Foundation Mo... | 2023-02-01 | Code |
| 7 | Florence | 80.16 | No | Florence: A New Foundation Model for Computer Vi... | 2021-11-22 | Code |
| 8 | Aurora (ours, r=64) | 77.69 | No | - | - | - |
| 9 | VK-OOD | 76.8 | No | Differentiable Outlier Detection Enable Robust D... | 2023-02-11 | Code |
| 10 | LXMERT (low-magnitude pruning) | 70.72 | No | LXMERT Model Compression for Visual Question Ans... | 2023-10-23 | Code |
| 11 | LocVLM-L | 56.2 | No | Learning to Localize Objects Improves Spatial Re... | 2024-04-11 | Code |