Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Captioning
/
MSR-VTT
Video Captioning on MSR-VTT
Metric: CIDEr (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
CIDEr (best first)
CIDEr (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
CIDEr
▼
Extra Data
Paper
Date
↕
Code
1
mPLUG-2
80
No
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
2
VAST
78
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
3
GIT2
75.9
Yes
GIT: A Generative Image-to-text Transformer for ...
2022-05-27
Code
4
VLAB
74.9
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
5
COSA
74.7
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
6
VALOR
74
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
7
MaMMUT (ours)
73.6
No
MaMMUT: A Simple Architecture for Joint Learning...
2023-03-29
Code
8
VideoCoCa
73.2
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
9
RTQ
69.3
Yes
RTQ: Rethinking Video-language Understanding Bas...
2023-12-01
Code
10
HowToCaption
65.3
No
HowToCaption: Prompting LLMs to Transform Video ...
2023-10-07
Code
11
HiTeA
65.1
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
12
Vid2Seq
64.6
Yes
Vid2Seq: Large-Scale Pretraining of a Visual Lan...
2023-02-27
Code
13
TextKG
60.8
No
Text with Knowledge Graph Augmented Transformer ...
2023-03-22
-
14
IcoCap (ViT-B/16)
60.2
Yes
-
-
-
15
MV-GPT
60
Yes
End-to-end Generative Pretraining for Multimodal...
2022-01-20
-
16
IcoCap (ViT-B/32)
59.1
Yes
-
-
-
17
CLIP-DCD
58.7
No
CLIP Meets Video Captioning: Concept-Aware Repre...
2021-11-30
Code
18
VIOLETv2
58
No
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
19
CoCap (ViT/L14)
57.2
No
Accurate and Fast Compressed Video Captioning
2023-09-22
Code
20
VASTA (Vatex-backbone)
56.08
No
Diverse Video Captioning by Adaptive Spatio-temp...
2022-08-19
Code
21
VASTA (Kinetics-backbone)
55
No
Diverse Video Captioning by Adaptive Spatio-temp...
2022-08-19
Code
22
EMCL-Net
54.6
No
Expectation-Maximization Contrastive Learning fo...
2022-11-21
Code
23
SEM-POS
53.1
No
SEM-POS: Grammatically and Semantically Correct ...
2023-03-26
-
24
UniVL + MELTR
52.77
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
#1
mPLUG-2
SOTA
80
CIDEr
· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Code
#2
VAST
78
CIDEr
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#3
GIT2
SOTA
75.9
CIDEr
· Extra Data
· 2022-05-27
GIT: A Generative Image-to-text Transformer for Vision and Language
Code
#4
VLAB
74.9
CIDEr
· Extra Data
· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#5
COSA
74.7
CIDEr
· Extra Data
· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Code
#6
VALOR
74
CIDEr
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#7
MaMMUT (ours)
73.6
CIDEr
· 2023-03-29
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Code
#8
VideoCoCa
73.2
CIDEr
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#9
RTQ
69.3
CIDEr
· Extra Data
· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Code
#10
HowToCaption
65.3
CIDEr
· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Code
#11
HiTeA
65.1
CIDEr
· Extra Data
· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#12
Vid2Seq
64.6
CIDEr
· Extra Data
· 2023-02-27
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Code
#13
TextKG
60.8
CIDEr
· 2023-03-22
Text with Knowledge Graph Augmented Transformer for Video Captioning
#14
IcoCap (ViT-B/16)
60.2
CIDEr
· Extra Data
No paper
#15
MV-GPT
SOTA
60
CIDEr
· Extra Data
· 2022-01-20
End-to-end Generative Pretraining for Multimodal Video Captioning
#16
IcoCap (ViT-B/32)
59.1
CIDEr
· Extra Data
No paper
#17
CLIP-DCD
SOTA
58.7
CIDEr
· 2021-11-30
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Code
#18
VIOLETv2
58
CIDEr
· 2022-09-04
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Code
#19
CoCap (ViT/L14)
57.2
CIDEr
· 2023-09-22
Accurate and Fast Compressed Video Captioning
Code
#20
VASTA (Vatex-backbone)
56.08
CIDEr
· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention
Code
#21
VASTA (Kinetics-backbone)
55
CIDEr
· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention
Code
#22
EMCL-Net
54.6
CIDEr
· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Code
#23
SEM-POS
53.1
CIDEr
· 2023-03-26
SEM-POS: Grammatically and Semantically Correct Video Captioning
#24
UniVL + MELTR
52.77
CIDEr
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code