Video Captioning on MSR-VTT

Metric: METEOR (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	METEOR▼	Extra Data	Paper	Date↕	Code
1	MV-GPT	38.7	Yes	End-to-end Generative Pretraining for Multimodal...	2022-01-20	-
2	mPLUG-2	34.9	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
3	VLAB	33.4	Yes	VLAB: Enhancing Video Language Pre-training by F...	2023-05-22	-
4	GIT2	33.1	Yes	GIT: A Generative Image-to-text Transformer for ...	2022-05-27	Code
5	VALOR	32.9	Yes	VALOR: Vision-Audio-Language Omni-Perception Pre...	2023-04-17	Code
6	HowToCaption	32.2	No	HowToCaption: Prompting LLMs to Transform Video ...	2023-10-07	Code
7	CLIP-DCD	31.3	No	CLIP Meets Video Captioning: Concept-Aware Repre...	2021-11-30	Code
8	IcoCap (ViT-B/16)	31.1	Yes	-	-	-
9	Vid2Seq	30.8	Yes	Vid2Seq: Large-Scale Pretraining of a Visual Lan...	2023-02-27	Code
10	HiTeA	30.7	Yes	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
11	SEM-POS	30.7	No	SEM-POS: Grammatically and Semantically Correct ...	2023-03-26	-
12	TextKG	30.5	No	Text with Knowledge Graph Augmented Transformer ...	2023-03-22	-
13	IcoCap (ViT-B/32)	30.3	Yes	-	-	-
14	CoCap (ViT/L14)	30.3	No	Accurate and Fast Compressed Video Captioning	2023-09-22	Code
15	VASTA (Vatex-backbone)	30.24	No	Diverse Video Captioning by Adaptive Spatio-temp...	2022-08-19	Code
16	VASTA (Kinetics-backbone)	30.2	No	Diverse Video Captioning by Adaptive Spatio-temp...	2022-08-19	Code
17	EMCL-Net	30.2	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
18	UniVL + MELTR	29.26	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code

#1MV-GPTSOTA
38.7
METEOR· Extra Data· 2022-01-20
End-to-end Generative Pretraining for Multimodal Video Captioning
#2mPLUG-2
34.9
METEOR· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#3VLAB
33.4
METEOR· Extra Data· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#4GIT2
33.1
METEOR· Extra Data· 2022-05-27
GIT: A Generative Image-to-text Transformer for Vision and Language Code
#5VALOR
32.9
METEOR· Extra Data· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Code
#6HowToCaption
32.2
METEOR· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Code
#7CLIP-DCDSOTA
31.3
METEOR· 2021-11-30
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter Code
#8IcoCap (ViT-B/16)
31.1
METEOR· Extra Data
No paper
#9Vid2Seq
30.8
METEOR· Extra Data· 2023-02-27
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Code
#10HiTeA
30.7
METEOR· Extra Data· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#11SEM-POS
30.7
METEOR· 2023-03-26
SEM-POS: Grammatically and Semantically Correct Video Captioning
#12TextKG
30.5
METEOR· 2023-03-22
Text with Knowledge Graph Augmented Transformer for Video Captioning
#13IcoCap (ViT-B/32)
30.3
METEOR· Extra Data
No paper
#14CoCap (ViT/L14)
30.3
METEOR· 2023-09-22
Accurate and Fast Compressed Video Captioning Code
#15VASTA (Vatex-backbone)
30.24
METEOR· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention Code
#16VASTA (Kinetics-backbone)
30.2
METEOR· 2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention Code
#17EMCL-Net
30.2
METEOR· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#18UniVL + MELTR
29.26
METEOR· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code