Metric: CIDEr (higher is better)
| # | Model↕ | CIDEr▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | PaLI | 149.1 | No | PaLI: A Jointly-Scaled Multilingual Language-Ima... | 2022-09-14 | Code |
| 2 | GIT2, Single Model | 124.18 | No | GIT: A Generative Image-to-text Transformer for ... | 2022-05-27 | Code |
| 3 | GIT, Single Model | 122.4 | No | GIT: A Generative Image-to-text Transformer for ... | 2022-05-27 | Code |
| 4 | PaLI | 121.09 | No | PaLI: A Jointly-Scaled Multilingual Language-Ima... | 2022-09-14 | Code |
| 5 | CoCa - Google Brain | 117.9 | No | - | - | - |
| 6 | Microsoft Cognitive Services team | 112.82 | No | VIVO: Visual Vocabulary Pre-Training for Novel O... | 2020-09-28 | - |
| 7 | Single Model | 108.98 | No | SimVLM: Simple Visual Language Model Pretraining... | 2021-08-24 | Code |
| 8 | GRIT (zero-shot, no VL pretraining, no CBS) | 105.9 | No | GRIT: Faster and Better Image captioning Transfo... | 2022-07-20 | Code |
| 9 | FudanFVL | 104.9 | No | - | - | - |
| 10 | FudanWYZ | 104.25 | No | - | - | - |
| 11 | IEDA-LAB | 102.64 | No | - | - | - |
| 12 | vll@mk514 | 101.69 | No | - | - | - |
| 13 | MD | 100.03 | No | - | - | - |
| 14 | firethehole | 99.9 | No | - | - | - |
| 15 | VinVL (Microsoft Cognitive Services + MSR) | 97.99 | No | VinVL: Revisiting Visual Representations in Visi... | 2021-01-02 | Code |
| 16 | ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS | 96.63 | No | - | - | - |
| 17 | camel XE | 88.08 | No | - | - | - |
| 18 | evertyhing | 87.86 | No | - | - | - |
| 19 | RCAL | 87.28 | No | - | - | - |
| 20 | icgp2ssi1_coco_si_0.02_5_test | 87.21 | No | - | - | - |
| 21 | cxy_nocaps_training | 85.81 | No | - | - | - |
| 22 | 作者给的test文件 | 85.81 | No | - | - | - |
| 23 | ClipCap (Transformer) | 84.85 | No | ClipCap: CLIP Prefix for Image Captioning | 2021-11-18 | Code |
| 24 | Oscar | 84.83 | No | - | - | - |
| 25 | Xinyi | 84.79 | No | - | - | - |
| 26 | Human | 80.61 | No | - | - | - |
| 27 | MQ-UpDown-C | 80.19 | No | - | - | - |
| 28 | ClipCap (MLP + GPT2 tuning) | 79.73 | No | ClipCap: CLIP Prefix for Image Captioning | 2021-11-18 | Code |
| 29 | UpDown + ELMo + CBS | 76.02 | No | - | - | - |
| 30 | UpDown | 74.27 | No | - | - | - |
| 31 | nocaps_training | 74.27 | No | - | - | - |
| 32 | 7_10-7_40000_predict_test.json | 73.73 | No | - | - | - |
| 33 | None | 70.33 | No | - | - | - |
| 34 | YX | 69.59 | No | - | - | - |
| 35 | B2 | 68.98 | No | - | - | - |
| 36 | area_attention | 67.91 | No | - | - | - |
| 37 | coco_all_19 | 64.37 | No | - | - | - |
| 38 | Neural Baby Talk + CBS | 62.96 | No | - | - | - |
| 39 | Neural Baby Talk | 60.89 | No | - | - | - |
| 40 | CS395T | 58.93 | No | - | - | - |
| 41 | Yu-Wu | 53.34 | No | - | - | - |