Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Image Captioning
/
COCO Captions
Image Captioning on COCO Captions
Metric: BLEU-4 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
BLEU-4 (best first)
BLEU-4 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
BLEU-4
▼
Extra Data
Paper
Date
↕
Code
1
mPLUG
46.5
No
mPLUG: Effective and Efficient Vision-Language L...
2022-05-24
Code
2
OFA
44.9
No
OFA: Unifying Architectures, Tasks, and Modaliti...
2022-02-07
Code
3
GIT
44.1
No
GIT: A Generative Image-to-text Transformer for ...
2022-05-27
Code
4
BLIP-2 ViT-G OPT 2.7B (zero-shot)
43.7
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
5
BLIP-2 ViT-G OPT 6.7B (zero-shot)
43.5
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
6
ExpansionNet v2 (No VL pretraining)
42.7
No
Exploiting Multiple Sequence Lengths in Fast End...
2022-08-13
Code
7
LEMON
42.6
No
Scaling Up Vision-Language Pre-training for Imag...
2021-11-24
-
8
BLIP-2 ViT-G FlanT5 XL (zero-shot)
42.4
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
9
GRIT (No VL pretraining - base)
42.4
No
GRIT: Faster and Better Image captioning Transfo...
2022-07-20
Code
10
Prompt Tuning
41.81
No
Prompt Tuning for Generative Multimodal Pretrain...
2022-08-04
Code
11
Oscar
41.7
No
Oscar: Object-Semantics Aligned Pre-training for...
2020-04-13
Code
12
Xmodal-Ctx
41.4
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
13
Xmodal-Ctx + OSCAR
41.3
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
14
X-VLM (base)
41.3
No
Multi-Grained Vision Language Pre-Training: Alig...
2021-11-16
Code
15
VinVL
41
No
VinVL: Revisiting Visual Representations in Visi...
2021-01-02
Code
16
CoCa
40.9
No
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
17
SimVLM
40.6
No
SimVLM: Simple Visual Language Model Pretraining...
2021-08-24
Code
18
Prismer
40.4
No
Prismer: A Vision-Language Model with Multi-Task...
2023-03-04
Code
19
PTP-BLIP (14M)
40.1
No
Position-guided Text Prompt for Vision-Language ...
2022-12-19
Code
20
L-Verse
39.9
No
L-Verse: Bidirectional Generation Between Image ...
2021-11-22
Code
21
Xmodal-Ctx
39.7
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
22
X-Transformer
39.7
No
X-Linear Attention Networks for Image Captioning
2020-03-31
Code
23
AoANet + VC
39.5
No
Visual Commonsense R-CNN
2020-02-27
Code
24
Transformer_NSC
39.4
No
A Better Variant of Self-Critical Sequence Train...
2020-03-22
Code
25
Meshed-Memory Transformer
39.1
No
Meshed-Memory Transformer for Image Captioning
2019-12-17
Code
26
CLIP Text Encoder (RL w/ CIDEr-reward)
38.2
No
Fine-grained Image Captioning with CLIP Reward
2022-05-26
Code
27
RefineCap (w/ REINFORCE)
37.8
No
RefineCap: Concept-Aware Refinement for Image Ca...
2021-09-08
-
28
RDN
37.3
No
Reflective Decoding Network for Image Captioning
2019-08-30
-
29
SmallCapd=16, Large
37.2
No
SmallCap: Lightweight Image Captioning Prompted ...
2022-09-30
Code
30
ClipCap (Transformer)
33.53
No
ClipCap: CLIP Prefix for Image Captioning
2021-11-18
Code
31
ClipCap (MLP + GPT2 tuning)
32.15
No
ClipCap: CLIP Prefix for Image Captioning
2021-11-18
Code
32
CapDec
26.4
No
Text-Only Training for Image Captioning using No...
2022-11-01
Code
33
From Captions to Visual Concepts and Back
25.7
No
From Captions to Visual Concepts and Back
2014-11-18
Code
34
VLKD (ViT-B/16)
16.7
No
-
-
-
35
LaDiC (ours, 30 steps)
0.382
No
LaDiC: Are Diffusion Models Really Inferior to A...
2024-04-16
Code
#1
mPLUG
SOTA
46.5
BLEU-4
· 2022-05-24
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Code
#2
OFA
SOTA
44.9
BLEU-4
· 2022-02-07
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Code
#3
GIT
44.1
BLEU-4
· 2022-05-27
GIT: A Generative Image-to-text Transformer for Vision and Language
Code
#4
BLIP-2 ViT-G OPT 2.7B (zero-shot)
43.7
BLEU-4
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#5
BLIP-2 ViT-G OPT 6.7B (zero-shot)
43.5
BLEU-4
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#6
ExpansionNet v2 (No VL pretraining)
42.7
BLEU-4
· 2022-08-13
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
Code
#7
LEMON
SOTA
42.6
BLEU-4
· 2021-11-24
Scaling Up Vision-Language Pre-training for Image Captioning
#8
BLIP-2 ViT-G FlanT5 XL (zero-shot)
42.4
BLEU-4
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#9
GRIT (No VL pretraining - base)
42.4
BLEU-4
· 2022-07-20
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Code
#10
Prompt Tuning
41.81
BLEU-4
· 2022-08-04
Prompt Tuning for Generative Multimodal Pretrained Models
Code
#11
Oscar
SOTA
41.7
BLEU-4
· 2020-04-13
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Code
#12
Xmodal-Ctx
41.4
BLEU-4
· 2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Code
#13
Xmodal-Ctx + OSCAR
41.3
BLEU-4
· 2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Code
#14
X-VLM (base)
41.3
BLEU-4
· 2021-11-16
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Code
#15
VinVL
41
BLEU-4
· 2021-01-02
VinVL: Revisiting Visual Representations in Vision-Language Models
Code
#16
CoCa
40.9
BLEU-4
· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models
Code
#17
SimVLM
40.6
BLEU-4
· 2021-08-24
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Code
#18
Prismer
40.4
BLEU-4
· 2023-03-04
Prismer: A Vision-Language Model with Multi-Task Experts
Code
#19
PTP-BLIP (14M)
40.1
BLEU-4
· 2022-12-19
Position-guided Text Prompt for Vision-Language Pre-training
Code
#20
L-Verse
39.9
BLEU-4
· 2021-11-22
L-Verse: Bidirectional Generation Between Image and Text
Code
#21
Xmodal-Ctx
39.7
BLEU-4
· 2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Code
#22
X-Transformer
SOTA
39.7
BLEU-4
· 2020-03-31
X-Linear Attention Networks for Image Captioning
Code
#23
AoANet + VC
SOTA
39.5
BLEU-4
· 2020-02-27
Visual Commonsense R-CNN
Code
#24
Transformer_NSC
39.4
BLEU-4
· 2020-03-22
A Better Variant of Self-Critical Sequence Training
Code
#25
Meshed-Memory Transformer
SOTA
39.1
BLEU-4
· 2019-12-17
Meshed-Memory Transformer for Image Captioning
Code
#26
CLIP Text Encoder (RL w/ CIDEr-reward)
38.2
BLEU-4
· 2022-05-26
Fine-grained Image Captioning with CLIP Reward
Code
#27
RefineCap (w/ REINFORCE)
37.8
BLEU-4
· 2021-09-08
RefineCap: Concept-Aware Refinement for Image Captioning
#28
RDN
SOTA
37.3
BLEU-4
· 2019-08-30
Reflective Decoding Network for Image Captioning
#29
SmallCapd=16, Large
37.2
BLEU-4
· 2022-09-30
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
Code
#30
ClipCap (Transformer)
33.53
BLEU-4
· 2021-11-18
ClipCap: CLIP Prefix for Image Captioning
Code
#31
ClipCap (MLP + GPT2 tuning)
32.15
BLEU-4
· 2021-11-18
ClipCap: CLIP Prefix for Image Captioning
Code
#32
CapDec
26.4
BLEU-4
· 2022-11-01
Text-Only Training for Image Captioning using Noise-Injected CLIP
Code
#33
From Captions to Visual Concepts and Back
SOTA
25.7
BLEU-4
· 2014-11-18
From Captions to Visual Concepts and Back
Code
#34
VLKD (ViT-B/16)
16.7
BLEU-4
No paper
#35
LaDiC (ours, 30 steps)
0.382
BLEU-4
· 2024-04-16
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Code