Metric: CIDEr (higher is better)
| # | Model↕ | CIDEr▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Unified VLP | 67.4 | No | Unified Vision-Language Pre-Training for Image C... | 2019-09-24 | Code |
| 2 | KOSMOS-1 1.6B (zero-shot) | 67.1 | No | - | - | - |
| 3 | Cornia et al | 46.4 | Yes | Paying More Attention to Saliency: Image Caption... | 2017-06-26 | - |
| 4 | MetaLM | 43.3 | No | Language Models are General-Purpose Interfaces | 2022-06-13 | Code |
| 5 | FewVLM | 31 | No | A Good Prompt Is Worth Millions of Parameters: L... | 2021-10-16 | Code |
| 6 | BRNN | 24.7 | No | Deep Visual-Semantic Alignments for Generating I... | 2014-12-07 | Code |
| 7 | VL-T5 | 2.6 | No | Unifying Vision-and-Language Tasks via Text Gene... | 2021-02-04 | Code |