Metric: CIDEr (higher is better)
| # | Model↕ | CIDEr▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 123.7 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 2 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 123.7 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 3 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 123 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 4 | LEMON_large | 116.9 | No | Scaling Up Vision-Language Pre-training for Imag... | 2021-11-24 | - |
| 5 | BLIP_ViT-L | 114.9 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 6 | SimVLM | 113.7 | No | SimVLM: Simple Visual Language Model Pretraining... | 2021-08-24 | Code |
| 7 | BLIP_CapFilt-L | 111.8 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 8 | LEMON_base | 107.7 | No | Scaling Up Vision-Language Pre-training for Imag... | 2021-11-24 | - |
| 9 | OmniVL | 104.6 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 10 | VinVL | 103.1 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 11 | Enc-Dec | 92.6 | No | Conceptual 12M: Pushing Web-Scale Image-Text Pre... | 2021-02-17 | Code |