Metric: SPICE (higher is better)
| # | Model↕ | SPICE▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 15.1 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 2 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 15.1 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 3 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 14.8 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 4 | BLIP_ViT-L | 14.4 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 5 | BLIP_CapFilt-L | 14.2 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 6 | OmniVL | 14.2 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 7 | LEMON_large | 14 | No | Scaling Up Vision-Language Pre-training for Imag... | 2021-11-24 | - |
| 8 | VinVL | 12.1 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 9 | Enc-Dec | 11.9 | No | Conceptual 12M: Pushing Web-Scale Image-Text Pre... | 2021-02-17 | Code |