Video Captioning on MSR-VTT

Metric: ROUGE-L (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	ROUGE-L▼	Extra Data	Paper	Date↕	Code
1	mPLUG-2	70.1	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
2	VLAB	68.3	Yes	VLAB: Enhancing Video Language Pre-training by F...	2023-05-22	-
3	GIT2	68.2	Yes	GIT: A Generative Image-to-text Transformer for ...	2022-05-27	Code
4	VALOR	68	Yes	VALOR: Vision-Audio-Language Omni-Perception Pre...	2023-04-17	Code
5	VideoCoCa	68	Yes	VideoCoCa: Video-Text Modeling with Zero-Shot Tr...	2022-12-09	-
6	HowToCaption	66.3	No	HowToCaption: Prompting LLMs to Transform Video ...	2023-10-07	Code
7	RTQ	66.1	Yes	RTQ: Rethinking Video-language Understanding Bas...	2023-12-01	Code
8	HiTeA	65	Yes	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
9	IcoCap (ViT-B/16)	64.9	Yes	-	-	-
10	TextKG	64.8	No	Text with Knowledge Graph Augmented Transformer ...	2023-03-22	-
11	CLIP-DCD	64.8	No	CLIP Meets Video Captioning: Concept-Aware Repre...	2021-11-30	Code
12	IcoCap (ViT-B/32)	64.3	Yes	-	-	-
13	SEM-POS	64.1	No	SEM-POS: Grammatically and Semantically Correct ...	2023-03-26	-
14	MV-GPT	64	Yes	End-to-end Generative Pretraining for Multimodal...	2022-01-20	-
15	CoCap (ViT/L14)	63.4	No	Accurate and Fast Compressed Video Captioning	2023-09-22	Code
16	EMCL-Net	63.2	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
17	VASTA (Vatex-backbone)	62.9	No	Diverse Video Captioning by Adaptive Spatio-temp...	2022-08-19	Code
18	VASTA (Kinetics-backbone)	62.5	No	Diverse Video Captioning by Adaptive Spatio-temp...	2022-08-19	Code
19	UniVL + MELTR	62.35	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code

#1mPLUG-2SOTA
70.1
ROUGE-L· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#2VLAB
68.3
ROUGE-L· Extra Data· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#3GIT2SOTA
68.2
ROUGE-L· Extra Data· 2022-05-27
GIT: A Generative Image-to-text Transformer for Vision and Language Code
#4VALOR
68
ROUGE-L· Extra Data· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Code
#5VideoCoCa
68
ROUGE-L· Extra Data· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#6HowToCaption
66.3
ROUGE-L· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Code
#7RTQ
66.1
ROUGE-L· Extra Data· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model Code
#8HiTeA
65
ROUGE-L· Extra Data· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#9IcoCap (ViT-B/16)
64.9
ROUGE-L· Extra Data
No paper
#10TextKG
64.8
ROUGE-L· 2023-03-22
Text with Knowledge Graph Augmented Transformer for Video Captioning
#11CLIP-DCDSOTA
64.8
ROUGE-L· 2021-11-30
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter Code
#12IcoCap (ViT-B/32)
64.3
ROUGE-L· Extra Data
No paper
#13SEM-POS
64.1
ROUGE-L· 2023-03-26
SEM-POS: Grammatically and Semantically Correct Video Captioning
#14MV-GPT
64
ROUGE-L· Extra Data· 2022-01-20
End-to-end Generative Pretraining for Multimodal Video Captioning
#15CoCap (ViT/L14)
63.4
ROUGE-L· 2023-09-22
Accurate and Fast Compressed Video Captioning Code
#16EMCL-Net
63.2
ROUGE-L· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#17VASTA (Vatex-backbone)
62.9
ROUGE-L· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention Code
#18VASTA (Kinetics-backbone)
62.5
ROUGE-L· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention Code
#19UniVL + MELTR
62.35
ROUGE-L· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code