Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Captioning
/
MSVD
Video Captioning on MSVD
Metric: CIDEr (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
CIDEr
▼
Extra Data
Paper
Date
↕
Code
1
MaMMUT
195.6
No
MaMMUT: A Simple Architecture for Joint Learning...
2023-03-29
Code
2
VLAB
179.8
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
3
VALOR
178.5
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
4
COSA
178.5
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
5
mPLUG-2
165.8
No
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
6
HowToCaption
154.2
No
HowToCaption: Prompting LLMs to Transform Video ...
2023-10-07
Code
7
HiTeA
146.9
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
8
Vid2Seq
146.2
Yes
Vid2Seq: Large-Scale Pretraining of a Visual Lan...
2023-02-27
Code
9
VIOLETv2
139.2
No
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
10
RTQ
123.4
No
RTQ: Rethinking Video-language Understanding Bas...
2023-12-01
Code
11
CoCap (ViT/L14)
121.5
No
Accurate and Fast Compressed Video Captioning
2023-09-22
Code
12
VASTA (Vatex-backbone)
119.7
No
Diverse Video Captioning by Adaptive Spatio-temp...
2022-08-19
Code
13
IcoCap (ViT-B/16)
110.3
Yes
-
-
-
14
SEM-POS
108.3
No
SEM-POS: Grammatically and Semantically Correct ...
2023-03-26
-
15
VASTA (Kinetics-backbone)
106.4
No
Diverse Video Captioning by Adaptive Spatio-temp...
2022-08-19
Code
16
IcoCap (ViT-B/32)
103.8
Yes
-
-
-