Metric: CIDEr (higher is better)
| # | Model↕ | CIDEr▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 121.6 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 2 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 121 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 3 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 119.7 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 4 | LEMON_large | 113.4 | No | Scaling Up Vision-Language Pre-training for Imag... | 2021-11-24 | - |
| 5 | BLIP_ViT-L | 113.2 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 6 | SimVLM | 112.2 | No | SimVLM: Simple Visual Language Model Pretraining... | 2021-08-24 | Code |
| 7 | BLIP_CapFilt-L | 109.6 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 8 | OmniVL | 107.5 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 9 | VinVL | 95.5 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 10 | Enc-Dec | 90.2 | No | Conceptual 12M: Pushing Web-Scale Image-Text Pre... | 2021-02-17 | Code |
| 11 | OSCAR | 80.9 | No | Oscar: Object-Semantics Aligned Pre-training for... | 2020-04-13 | Code |