Metric: CIDEr (higher is better)
| # | Model↕ | CIDEr▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 124.8 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 2 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 124.4 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 3 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 123.4 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 4 | BLIP_ViT-L | 115.3 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 5 | SimVLM | 115.2 | No | SimVLM: Simple Visual Language Model Pretraining... | 2021-08-24 | Code |
| 6 | BLIP_CapFilt-L | 111.5 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 7 | LEMON_large | 111.3 | No | Scaling Up Vision-Language Pre-training for Imag... | 2021-11-24 | - |
| 8 | OmniVL | 106.3 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 9 | Enc-Dec | 94.5 | No | Conceptual 12M: Pushing Web-Scale Image-Text Pre... | 2021-02-17 | Code |
| 10 | VinVL | 88.3 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |