Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Image Captioning
/
COCO Captions
Image Captioning on COCO Captions
Metric: CIDER (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
CIDER
▼
Extra Data
Paper
Date
↕
Code
1
mPLUG
155.1
No
mPLUG: Effective and Efficient Vision-Language L...
2022-05-24
Code
2
OFA
154.9
No
OFA: Unifying Architectures, Tasks, and Modaliti...
2022-02-07
Code
3
VALOR
152.5
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
4
GIT
151.1
No
GIT: A Generative Image-to-text Transformer for ...
2022-05-27
Code
5
VAST
149
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
6
BLIP-2 ViT-G OPT 2.7B (zero-shot)
145.8
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
7
LEMON
145.5
No
Scaling Up Vision-Language Pre-training for Imag...
2021-11-24
-
8
BLIP-2 ViT-G OPT 6.7B (zero-shot)
145.2
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
9
BLIP-2 ViT-G FlanT5 XL (zero-shot)
144.5
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
10
GRIT (No VL pretraining - base)
144.2
No
GRIT: Faster and Better Image captioning Transfo...
2022-07-20
Code
11
ExpansionNet v2 (No VL pretraining)
143.7
No
Exploiting Multiple Sequence Lengths in Fast End...
2022-08-13
Code
12
CoCa
143.6
No
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
13
SimVLM
143.3
No
SimVLM: Simple Visual Language Model Pretraining...
2021-08-24
Code
14
Xmodal-Ctx + OSCAR
142.2
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
15
Prompt Tuning
141.4
No
Prompt Tuning for Generative Multimodal Pretrain...
2022-08-04
Code
16
VinVL
140.9
No
VinVL: Revisiting Visual Representations in Visi...
2021-01-02
Code
17
X-VLM (base)
140.8
No
Multi-Grained Vision Language Pre-Training: Alig...
2021-11-16
Code
18
Oscar
140
No
Oscar: Object-Semantics Aligned Pre-training for...
2020-04-13
Code
19
Xmodal-Ctx
139.9
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
20
Prismer
136.5
No
Prismer: A Vision-Language Model with Multi-Task...
2023-03-04
Code
21
Xmodal-Ctx
135.9
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
22
PTP-BLIP (14M)
135
No
Position-guided Text Prompt for Vision-Language ...
2022-12-19
Code
23
X-Transformer
132.8
No
X-Linear Attention Networks for Image Captioning
2020-03-31
Code
24
Meshed-Memory Transformer
131.2
No
Meshed-Memory Transformer for Image Captioning
2019-12-17
Code
25
Transformer_NSC
129.6
No
A Better Variant of Self-Critical Sequence Train...
2020-03-22
Code
26
RefineCap (w/ REINFORCE)
127.2
No
RefineCap: Concept-Aware Refinement for Image Ca...
2021-09-08
-
27
LaDiC (ours, 30 steps)
126.2
No
LaDiC: Are Diffusion Models Really Inferior to A...
2024-04-16
Code
28
RDN
125.2
No
Reflective Decoding Network for Image Captioning
2019-08-30
-
29
CLIP Text Encoder (RL w/ CIDEr-reward)
124.9
No
Fine-grained Image Captioning with CLIP Reward
2022-05-26
Code
30
SmallCapd=16, Large
121.8
No
SmallCap: Lightweight Image Captioning Prompted ...
2022-09-30
Code
31
ClipCap (Transformer)
113.08
No
ClipCap: CLIP Prefix for Image Captioning
2021-11-18
Code
32
ClipCap (MLP + GPT2 tuning)
108.35
No
ClipCap: CLIP Prefix for Image Captioning
2021-11-18
Code
33
Virtex (ResNet-101)
94
No
VirTex: Learning Visual Representations from Tex...
2020-06-11
Code
34
CapDec
91.8
No
Text-Only Training for Image Captioning using No...
2022-11-01
Code
35
KOSMOS-1 (1.6B) (zero-shot)
84.7
No
-
-
-
36
VLKD (ViT-B/16)
58.3
No
-
-
-