Image Captioning on COCO (Common Objects in Context)

Metric: CIDEr (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	CIDEr▼	Extra Data	Paper	Date↕	Code
1	ExpansionNet v2	143.7	No	Exploiting Multiple Sequence Lengths in Fast End...	2022-08-13	Code
2	M2 Transformer	131.2	No	Meshed-Memory Transformer for Image Captioning	2019-12-17	Code
3	IGINet	131	No	-	-	-
4	UNIMO-large	127.7	No	UNIMO: Towards Unified-Modal Understanding and G...	2020-12-31	Code
5	RDN	125.2	No	Reflective Decoding Network for Image Captioning	2019-08-30	-
6	Lyrics	121.1	No	Lyrics: Boosting Fine-grained Language-Vision Al...	2023-12-08	-
7	Bit Diffusion (20 steps)	115	No	Analog Bits: Generating Discrete Data using Diff...	2022-08-08	Code
8	Flamingo (80B; 4-shot)	103	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-
9	RA-CM3 (2.7B)	89.1	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-
10	Flamingo (3B; 4-shot)	85	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-
11	Perturb, Predict & Paraphrase	84.5	No	-	-	Code
12	Parti	83.9	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-
13	NIC (ResNet-50, CutMix)	77.6	No	CutMix: Regularization Strategy to Train Strong ...	2019-05-13	Code
14	Vanilla CM3	71.9	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-
15	X-LXMERT	55.8	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-
16	minDALL-E	48	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-
17	ruDALL-E-XL	38.7	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-
18	DALL-E	20.2	No	Retrieval-Augmented Multimodal Language Modeling	2022-11-22	-

#1ExpansionNet v2SOTA
143.7
CIDEr· 2022-08-13
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning Code
#2M2 TransformerSOTA
131.2
CIDEr· 2019-12-17
Meshed-Memory Transformer for Image Captioning Code
#3IGINet
131
CIDEr
No paper
#4UNIMO-large
127.7
CIDEr· 2020-12-31
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning Code
#5RDNSOTA
125.2
CIDEr· 2019-08-30
Reflective Decoding Network for Image Captioning
#6Lyrics
121.1
CIDEr· 2023-12-08
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
#7Bit Diffusion (20 steps)
115
CIDEr· 2022-08-08
Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning Code
#8Flamingo (80B; 4-shot)
103
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling
#9RA-CM3 (2.7B)
89.1
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling
#10Flamingo (3B; 4-shot)
85
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling
#11Perturb, Predict & Paraphrase
84.5
CIDEr
No paperCode
#12Parti
83.9
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling
#13NIC (ResNet-50, CutMix)SOTA
77.6
CIDEr· 2019-05-13
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features Code
#14Vanilla CM3
71.9
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling
#15X-LXMERT
55.8
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling
#16minDALL-E
48
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling
#17ruDALL-E-XL
38.7
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling
#18DALL-E
20.2
CIDEr· 2022-11-22
Retrieval-Augmented Multimodal Language Modeling