Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Captioning
/
MSVD
Video Captioning on MSVD
Metric: CIDEr (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
CIDEr (best first)
CIDEr (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
CIDEr
▼
Extra Data
Paper
Date
↕
Code
1
MaMMUT
195.6
No
MaMMUT: A Simple Architecture for Joint Learning...
2023-03-29
Code
2
VLAB
179.8
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
3
VALOR
178.5
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
4
COSA
178.5
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
5
mPLUG-2
165.8
No
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
6
HowToCaption
154.2
No
HowToCaption: Prompting LLMs to Transform Video ...
2023-10-07
Code
7
HiTeA
146.9
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
8
Vid2Seq
146.2
Yes
Vid2Seq: Large-Scale Pretraining of a Visual Lan...
2023-02-27
Code
9
VIOLETv2
139.2
No
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
10
RTQ
123.4
No
RTQ: Rethinking Video-language Understanding Bas...
2023-12-01
Code
11
CoCap (ViT/L14)
121.5
No
Accurate and Fast Compressed Video Captioning
2023-09-22
Code
12
VASTA (Vatex-backbone)
119.7
No
Diverse Video Captioning by Adaptive Spatio-temp...
2022-08-19
Code
13
IcoCap (ViT-B/16)
110.3
Yes
-
-
-
14
SEM-POS
108.3
No
SEM-POS: Grammatically and Semantically Correct ...
2023-03-26
-
15
VASTA (Kinetics-backbone)
106.4
No
Diverse Video Captioning by Adaptive Spatio-temp...
2022-08-19
Code
16
IcoCap (ViT-B/32)
103.8
Yes
-
-
-
#1
MaMMUT
SOTA
195.6
CIDEr
· 2023-03-29
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Code
#2
VLAB
179.8
CIDEr
· Extra Data
· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#3
VALOR
178.5
CIDEr
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#4
COSA
178.5
CIDEr
· Extra Data
· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Code
#5
mPLUG-2
SOTA
165.8
CIDEr
· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Code
#6
HowToCaption
154.2
CIDEr
· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Code
#7
HiTeA
SOTA
146.9
CIDEr
· Extra Data
· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#8
Vid2Seq
146.2
CIDEr
· Extra Data
· 2023-02-27
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Code
#9
VIOLETv2
SOTA
139.2
CIDEr
· 2022-09-04
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Code
#10
RTQ
123.4
CIDEr
· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Code
#11
CoCap (ViT/L14)
121.5
CIDEr
· 2023-09-22
Accurate and Fast Compressed Video Captioning
Code
#12
VASTA (Vatex-backbone)
SOTA
119.7
CIDEr
· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention
Code
#13
IcoCap (ViT-B/16)
110.3
CIDEr
· Extra Data
No paper
#14
SEM-POS
108.3
CIDEr
· 2023-03-26
SEM-POS: Grammatically and Semantically Correct Video Captioning
#15
VASTA (Kinetics-backbone)
106.4
CIDEr
· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention
Code
#16
IcoCap (ViT-B/32)
103.8
CIDEr
· Extra Data
No paper