Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Image Captioning
/
COCO Captions
Image Captioning on COCO Captions
Metric: SPICE (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
SPICE (best first)
SPICE (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
SPICE
▼
Extra Data
Paper
Date
↕
Code
1
VAST
27
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
2
OFA
26.6
No
OFA: Unifying Architectures, Tasks, and Modaliti...
2022-02-07
Code
3
GIT
26.3
No
GIT: A Generative Image-to-text Transformer for ...
2022-05-27
Code
4
mPLUG
26
No
mPLUG: Effective and Efficient Vision-Language L...
2022-05-24
Code
5
VALOR
25.7
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
6
LEMON
25.5
No
Scaling Up Vision-Language Pre-training for Imag...
2021-11-24
-
7
SimVLM
25.4
No
SimVLM: Simple Visual Language Model Pretraining...
2021-08-24
Code
8
VinVL
25.2
No
VinVL: Revisiting Visual Representations in Visi...
2021-01-02
Code
9
Xmodal-Ctx + OSCAR
24.9
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
10
ExpansionNet v2 (No VL pretraining)
24.7
No
Exploiting Multiple Sequence Lengths in Fast End...
2022-08-13
Code
11
CoCa
24.7
No
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
12
Oscar
24.5
No
Oscar: Object-Semantics Aligned Pre-training for...
2020-04-13
Code
13
Prompt Tuning
24.42
No
Prompt Tuning for Generative Multimodal Pretrain...
2022-08-04
Code
14
Prismer
24.4
No
Prismer: A Vision-Language Model with Multi-Task...
2023-03-04
Code
15
GRIT (No VL pretraining - base)
24.3
No
GRIT: Faster and Better Image captioning Transfo...
2022-07-20
Code
16
Xmodal-Ctx
24
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
17
PTP-BLIP (14M)
23.7
No
Position-guided Text Prompt for Vision-Language ...
2022-12-19
Code
18
Xmodal-Ctx
23.7
No
Beyond a Pre-Trained Object Detector: Cross-Moda...
2022-05-09
Code
19
X-Transformer
23.4
No
X-Linear Attention Networks for Image Captioning
2020-03-31
Code
20
L-Verse
23.3
No
L-Verse: Bidirectional Generation Between Image ...
2021-11-22
Code
21
Transformer_NSC
22.8
No
A Better Variant of Self-Critical Sequence Train...
2020-03-22
Code
22
Meshed-Memory Transformer
22.6
No
Meshed-Memory Transformer for Image Captioning
2019-12-17
Code
23
RefineCap (w/ REINFORCE)
22.5
No
RefineCap: Concept-Aware Refinement for Image Ca...
2021-09-08
-
24
LaDiC
22.4
No
LaDiC: Are Diffusion Models Really Inferior to A...
2024-04-16
Code
25
SmallCapd=16, Large
21.5
No
SmallCap: Lightweight Image Captioning Prompted ...
2022-09-30
Code
26
ClipCap (Transformer)
21.05
No
ClipCap: CLIP Prefix for Image Captioning
2021-11-18
Code
27
ClipCap (MLP + GPT2 tuning)
20.12
No
ClipCap: CLIP Prefix for Image Captioning
2021-11-18
Code
28
Virtex (ResNet-101)
18.5
No
VirTex: Learning Visual Representations from Tex...
2020-06-11
Code
29
KOSMOS-1 (1.6B) (zero-shot)
16.8
No
-
-
-
30
VLKD (ViT-B/16)
13.4
No
-
-
-
#1
VAST
SOTA
27
SPICE
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#2
OFA
SOTA
26.6
SPICE
· 2022-02-07
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Code
#3
GIT
26.3
SPICE
· 2022-05-27
GIT: A Generative Image-to-text Transformer for Vision and Language
Code
#4
mPLUG
26
SPICE
· 2022-05-24
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Code
#5
VALOR
25.7
SPICE
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#6
LEMON
SOTA
25.5
SPICE
· 2021-11-24
Scaling Up Vision-Language Pre-training for Image Captioning
#7
SimVLM
SOTA
25.4
SPICE
· 2021-08-24
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Code
#8
VinVL
SOTA
25.2
SPICE
· 2021-01-02
VinVL: Revisiting Visual Representations in Vision-Language Models
Code
#9
Xmodal-Ctx + OSCAR
24.9
SPICE
· 2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Code
#10
ExpansionNet v2 (No VL pretraining)
24.7
SPICE
· 2022-08-13
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
Code
#11
CoCa
24.7
SPICE
· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models
Code
#12
Oscar
SOTA
24.5
SPICE
· 2020-04-13
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Code
#13
Prompt Tuning
24.42
SPICE
· 2022-08-04
Prompt Tuning for Generative Multimodal Pretrained Models
Code
#14
Prismer
24.4
SPICE
· 2023-03-04
Prismer: A Vision-Language Model with Multi-Task Experts
Code
#15
GRIT (No VL pretraining - base)
24.3
SPICE
· 2022-07-20
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Code
#16
Xmodal-Ctx
24
SPICE
· 2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Code
#17
PTP-BLIP (14M)
23.7
SPICE
· 2022-12-19
Position-guided Text Prompt for Vision-Language Pre-training
Code
#18
Xmodal-Ctx
23.7
SPICE
· 2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Code
#19
X-Transformer
SOTA
23.4
SPICE
· 2020-03-31
X-Linear Attention Networks for Image Captioning
Code
#20
L-Verse
23.3
SPICE
· 2021-11-22
L-Verse: Bidirectional Generation Between Image and Text
Code
#21
Transformer_NSC
SOTA
22.8
SPICE
· 2020-03-22
A Better Variant of Self-Critical Sequence Training
Code
#22
Meshed-Memory Transformer
SOTA
22.6
SPICE
· 2019-12-17
Meshed-Memory Transformer for Image Captioning
Code
#23
RefineCap (w/ REINFORCE)
22.5
SPICE
· 2021-09-08
RefineCap: Concept-Aware Refinement for Image Captioning
#24
LaDiC
22.4
SPICE
· 2024-04-16
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Code
#25
SmallCapd=16, Large
21.5
SPICE
· 2022-09-30
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
Code
#26
ClipCap (Transformer)
21.05
SPICE
· 2021-11-18
ClipCap: CLIP Prefix for Image Captioning
Code
#27
ClipCap (MLP + GPT2 tuning)
20.12
SPICE
· 2021-11-18
ClipCap: CLIP Prefix for Image Captioning
Code
#28
Virtex (ResNet-101)
18.5
SPICE
· 2020-06-11
VirTex: Learning Visual Representations from Textual Annotations
Code
#29
KOSMOS-1 (1.6B) (zero-shot)
16.8
SPICE
No paper
#30
VLKD (ViT-B/16)
13.4
SPICE
No paper