Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Captioning
/
MSR-VTT
Video Captioning on MSR-VTT
Metric: ROUGE-L (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
ROUGE-L
▼
Extra Data
Paper
Date
↕
Code
1
mPLUG-2
70.1
No
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
2
VLAB
68.3
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
3
GIT2
68.2
Yes
GIT: A Generative Image-to-text Transformer for ...
2022-05-27
Code
4
VALOR
68
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
5
VideoCoCa
68
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
6
HowToCaption
66.3
No
HowToCaption: Prompting LLMs to Transform Video ...
2023-10-07
Code
7
RTQ
66.1
Yes
RTQ: Rethinking Video-language Understanding Bas...
2023-12-01
Code
8
HiTeA
65
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
9
IcoCap (ViT-B/16)
64.9
Yes
-
-
-
10
TextKG
64.8
No
Text with Knowledge Graph Augmented Transformer ...
2023-03-22
-
11
CLIP-DCD
64.8
No
CLIP Meets Video Captioning: Concept-Aware Repre...
2021-11-30
Code
12
IcoCap (ViT-B/32)
64.3
Yes
-
-
-
13
SEM-POS
64.1
No
SEM-POS: Grammatically and Semantically Correct ...
2023-03-26
-
14
MV-GPT
64
Yes
End-to-end Generative Pretraining for Multimodal...
2022-01-20
-
15
CoCap (ViT/L14)
63.4
No
Accurate and Fast Compressed Video Captioning
2023-09-22
Code
16
EMCL-Net
63.2
No
Expectation-Maximization Contrastive Learning fo...
2022-11-21
Code
17
VASTA (Vatex-backbone)
62.9
No
Diverse Video Captioning by Adaptive Spatio-temp...
2022-08-19
Code
18
VASTA (Kinetics-backbone)
62.5
No
Diverse Video Captioning by Adaptive Spatio-temp...
2022-08-19
Code
19
UniVL + MELTR
62.35
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code