Metric: SPICE (higher is better)
| # | Model↕ | SPICE▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 16.3 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 2 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 15.8 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 3 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 15.8 | No | BLIP-2: Bootstrapping Language-Image Pre-trainin... | 2023-01-30 | Code |
| 4 | LEMON_large | 15.8 | No | Scaling Up Vision-Language Pre-training for Imag... | 2021-11-24 | - |
| 5 | BLIP_ViT-L | 15.2 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 6 | OmniVL | 15 | No | OmniVL:One Foundation Model for Image-Language a... | 2022-09-15 | - |
| 7 | BLIP_CapFilt-L | 14.9 | No | BLIP: Bootstrapping Language-Image Pre-training ... | 2022-01-28 | Code |
| 8 | LEMON_base | 14.7 | No | Scaling Up Vision-Language Pre-training for Imag... | 2021-11-24 | - |
| 9 | VinVL | 14.2 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 10 | Enc-Dec | 12.5 | No | Conceptual 12M: Pushing Web-Scale Image-Text Pre... | 2021-02-17 | Code |