Video Captioning on MSVD

Metric: BLEU-4 (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Hide extra data

Sort:

#	Model↕	BLEU-4▼	Extra Data	Paper	Date↕	Code
1	VALOR	80.7	Yes	VALOR: Vision-Audio-Language Omni-Perception Pre...	2023-04-17	Code
2	VLAB	79.3	Yes	VLAB: Enhancing Video Language Pre-training by F...	2023-05-22	-
3	COSA	76.5	Yes	COSA: Concatenated Sample Pretrained Vision-Lang...	2023-06-15	Code
4	HiTeA	71	Yes	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
5	mPLUG-2	70.5	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
6	HowToCaption	70.4	No	HowToCaption: Prompting LLMs to Transform Video ...	2023-10-07	Code
7	RTQ	66.9	No	RTQ: Rethinking Video-language Understanding Bas...	2023-12-01	Code
8	CoCap (ViT/L14)	60.1	No	Accurate and Fast Compressed Video Captioning	2023-09-22	Code
9	SEM-POS	60.1	No	SEM-POS: Grammatically and Semantically Correct ...	2023-03-26	-
10	VASTA (Vatex-backbone)	59.2	No	Diverse Video Captioning by Adaptive Spatio-temp...	2022-08-19	Code
11	IcoCap (ViT-B/16)	59.1	Yes	-	-	-
12	IcoCap (ViT-B/32)	56.3	Yes	-	-	-
13	VASTA (Kinetics-backbone)	56.1	No	Diverse Video Captioning by Adaptive Spatio-temp...	2022-08-19	Code

#1VALORSOTA
80.7
BLEU-4· Extra Data· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Code
#2VLAB
79.3
BLEU-4· Extra Data· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#3COSA
76.5
BLEU-4· Extra Data· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model Code
#4HiTeASOTA
71
BLEU-4· Extra Data· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#5mPLUG-2
70.5
BLEU-4· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#6HowToCaption
70.4
BLEU-4· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Code
#7RTQ
66.9
BLEU-4· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model Code
#8CoCap (ViT/L14)
60.1
BLEU-4· 2023-09-22
Accurate and Fast Compressed Video Captioning Code
#9SEM-POS
60.1
BLEU-4· 2023-03-26
SEM-POS: Grammatically and Semantically Correct Video Captioning
#10VASTA (Vatex-backbone)SOTA
59.2
BLEU-4· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention Code
#11IcoCap (ViT-B/16)
59.1
BLEU-4· Extra Data
No paper
#12IcoCap (ViT-B/32)
56.3
BLEU-4· Extra Data
No paper
#13VASTA (Kinetics-backbone)
56.1
BLEU-4· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention Code