Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Image Captioning
/
COCO Captions
Image Captioning on COCO Captions
Metric: ROUGE-L (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
ROUGE-L (best first)
ROUGE-L (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
ROUGE-L
▼
Extra Data
Paper
Date
↕
Code
1
ExpansionNet v2 (No VL pretraining)
61.1
No
Exploiting Multiple Sequence Lengths in Fast End...
2022-08-13
Code
2
GRIT (No VL pretraining - base)
60.7
No
GRIT: Faster and Better Image captioning Transfo...
2022-07-20
Code
3
Xmodal-Ctx
60.4
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
4
L-Verse
60.4
No
L-Verse: Bidirectional Generation Between Image ...
2021-11-22
Code
5
Xmodal-Ctx
59.5
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
6
AoANet + VC
59.3
No
Visual Commonsense R-CNN
2020-02-27
Code
7
X-Transformer
59.1
No
X-Linear Attention Networks for Image Captioning
2020-03-31
Code
8
Transformer_NSC
58.7
No
A Better Variant of Self-Critical Sequence Train...
2020-03-22
Code
9
LaDiC
58.7
No
LaDiC: Are Diffusion Models Really Inferior to A...
2024-04-16
Code
10
Meshed-Memory Transformer
58.6
No
Meshed-Memory Transformer for Image Captioning
2019-12-17
Code
11
CLIP Text Encoder (RL w/ CIDEr-reward)
58.5
No
Fine-grained Image Captioning with CLIP Reward
2022-05-26
Code
12
RefineCap (w/ REINFORCE)
58
No
RefineCap: Concept-Aware Refinement for Image Ca...
2021-09-08
-
13
RDN
57.4
No
Reflective Decoding Network for Image Captioning
2019-08-30
-
#1
ExpansionNet v2 (No VL pretraining)
SOTA
61.1
ROUGE-L
· 2022-08-13
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
Code
#2
GRIT (No VL pretraining - base)
SOTA
60.7
ROUGE-L
· 2022-07-20
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Code
#3
Xmodal-Ctx
60.4
ROUGE-L
· 2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Code
#4
L-Verse
SOTA
60.4
ROUGE-L
· 2021-11-22
L-Verse: Bidirectional Generation Between Image and Text
Code
#5
Xmodal-Ctx
59.5
ROUGE-L
· 2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Code
#6
AoANet + VC
SOTA
59.3
ROUGE-L
· 2020-02-27
Visual Commonsense R-CNN
Code
#7
X-Transformer
59.1
ROUGE-L
· 2020-03-31
X-Linear Attention Networks for Image Captioning
Code
#8
Transformer_NSC
58.7
ROUGE-L
· 2020-03-22
A Better Variant of Self-Critical Sequence Training
Code
#9
LaDiC
58.7
ROUGE-L
· 2024-04-16
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Code
#10
Meshed-Memory Transformer
SOTA
58.6
ROUGE-L
· 2019-12-17
Meshed-Memory Transformer for Image Captioning
Code
#11
CLIP Text Encoder (RL w/ CIDEr-reward)
58.5
ROUGE-L
· 2022-05-26
Fine-grained Image Captioning with CLIP Reward
Code
#12
RefineCap (w/ REINFORCE)
58
ROUGE-L
· 2021-09-08
RefineCap: Concept-Aware Refinement for Image Captioning
#13
RDN
SOTA
57.4
ROUGE-L
· 2019-08-30
Reflective Decoding Network for Image Captioning