Zero-Shot Video Retrieval on MSR-VTT

Metric: text-to-video R@1 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video R@1▼	Extra Data	Paper	Date↕	Code
1	InternVideo2-6B	55.9	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
2	GRAM	54.8	Yes	Gramian Multimodal Representation Learning and A...	2024-12-16	Code
3	InternVideo2-1B	51.9	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
4	VAST, HowToCaption-finetuned	50	No	HowToCaption: Prompting LLMs to Transform Video ...	2023-10-07	Code
5	FluxViT-B	49.9	Yes	Make Your Training Flexible: Towards Deployment-...	2025-03-18	Code
6	VAST	49.3	Yes	VAST: A Vision-Audio-Subtitle-Text Omni-Modality...	2023-05-29	Code
7	mPLUG-2	47.1	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
8	FluxViT-S	45	Yes	Make Your Training Flexible: Towards Deployment-...	2025-03-18	Code
9	LanguageBind(ViT-H/14)	44.8	Yes	LanguageBind: Extending Video-Language Pretraini...	2023-10-03	Code
10	LanguageBind(ViT-L/14)	42.8	Yes	LanguageBind: Extending Video-Language Pretraini...	2023-10-03	Code
11	UMT-L (ViT-L/16)	42.6	Yes	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
12	vid-TLDR (UMT-L)	42.1	Yes	vid-TLDR: Training Free Token merging for Light-...	2024-03-20	Code
13	BT-Adapter	40.9	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
14	InternVideo	40.7	Yes	InternVideo: General Video Foundation Models via...	2022-12-06	Code
15	Florence	37.6	No	Florence: A New Foundation Model for Computer Vi...	2021-11-22	Code
16	HowToCaption	37.6	No	HowToCaption: Prompting LLMs to Transform Video ...	2023-10-07	Code
17	ImageBind	36.8	No	ImageBind: One Embedding Space To Bind Them All	2023-05-09	Code
18	OmniVL	34.6	Yes	OmniVL:One Foundation Model for Image-Language a...	2022-09-15	-
19	HiTeA-17M	34.4	No	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
20	Singularity-17M	34	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
21	CLIP4Clip	32	No	CLIP4Clip: An Empirical Study of CLIP for End to...	2021-04-18	Code
22	Yatai Ji et. al.	30.9	No	Seeing What You Miss: Vision-Language Pre-traini...	2022-11-24	Code
23	HiTeA-5M	29.9	No	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
24	Singularity-5M	28.4	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
25	Clover	26.4	No	Clover: Towards A Unified Video-Language Alignme...	2022-07-16	Code
26	MILES	26.1	No	MILES: Visual BERT Pre-training with Injected La...	2022-04-26	Code
27	Y. Ge et. al.	26	No	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
28	VIOLET	25.9	No	VIOLET : End-to-End Video-Language Transformers ...	2021-11-24	Code
29	FROZEN	24.7	No	Frozen in Time: A Joint Video and Image Encoder ...	2021-04-01	Code
30	ALPRO	24.1	No	Align and Prompt: Video-and-Language Pre-trainin...	2021-12-17	Code
31	OA-Trans	23.4	No	Object-aware Video-language Pre-training for Ret...	2021-12-01	Code
32	LaT	23.4	No	LaT: Latent Translation with Cycle-Consistency f...	2022-07-11	-
33	A. Nagrani et. al.	19.4	Yes	Learning Audio-Video Modalities from Image Capti...	2022-04-01	-
34	HD-VILA	14.6	No	Advancing High-Resolution Video-Language Represe...	2021-11-19	Code
35	Norton	10.7	No	Multi-granularity Correspondence Learning from L...	2024-01-30	Code
36	VideoCLIP	10.4	No	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code
37	MIL-NCE	9.9	No	End-to-End Learning of Visual Representations fr...	2019-12-13	Code
38	TACo	9.8	No	TACo: Token-aware Cascade Contrastive Learning f...	2021-08-23	-
39	SSML	8	No	Noise Estimation Using Density Estimation for Se...	2020-03-06	Code

#1InternVideo2-6BSOTA
55.9
text-to-video R@1· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#2GRAM
54.8
text-to-video R@1· Extra Data· 2024-12-16
Gramian Multimodal Representation Learning and Alignment Code
#3InternVideo2-1B
51.9
text-to-video R@1· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#4VAST, HowToCaption-finetunedSOTA
50
text-to-video R@1· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Code
#5FluxViT-B
49.9
text-to-video R@1· Extra Data· 2025-03-18
Make Your Training Flexible: Towards Deployment-Efficient Video Models Code
#6VASTSOTA
49.3
text-to-video R@1· Extra Data· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset Code
#7mPLUG-2SOTA
47.1
text-to-video R@1· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#8FluxViT-S
45
text-to-video R@1· Extra Data· 2025-03-18
Make Your Training Flexible: Towards Deployment-Efficient Video Models Code
#9LanguageBind(ViT-H/14)
44.8
text-to-video R@1· Extra Data· 2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment Code
#10LanguageBind(ViT-L/14)
42.8
text-to-video R@1· Extra Data· 2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment Code
#11UMT-L (ViT-L/16)
42.6
text-to-video R@1· Extra Data· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#12vid-TLDR (UMT-L)
42.1
text-to-video R@1· Extra Data· 2024-03-20
vid-TLDR: Training Free Token merging for Light-weight Video Transformer Code
#13BT-Adapter
40.9
text-to-video R@1· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#14InternVideoSOTA
40.7
text-to-video R@1· Extra Data· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#15FlorenceSOTA
37.6
text-to-video R@1· 2021-11-22
Florence: A New Foundation Model for Computer Vision Code
#16HowToCaption
37.6
text-to-video R@1· 2023-10-07
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Code
#17ImageBind
36.8
text-to-video R@1· 2023-05-09
ImageBind: One Embedding Space To Bind Them All Code
#18OmniVL
34.6
text-to-video R@1· Extra Data· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#19HiTeA-17M
34.4
text-to-video R@1· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#20Singularity-17M
34
text-to-video R@1· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#21CLIP4ClipSOTA
32
text-to-video R@1· 2021-04-18
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Code
#22Yatai Ji et. al.
30.9
text-to-video R@1· 2022-11-24
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning Code
#23HiTeA-5M
29.9
text-to-video R@1· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#24Singularity-5M
28.4
text-to-video R@1· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#25Clover
26.4
text-to-video R@1· 2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model Code
#26MILES
26.1
text-to-video R@1· 2022-04-26
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval Code
#27Y. Ge et. al.
26
text-to-video R@1· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#28VIOLET
25.9
text-to-video R@1· 2021-11-24
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling Code
#29FROZENSOTA
24.7
text-to-video R@1· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Code
#30ALPRO
24.1
text-to-video R@1· 2021-12-17
Align and Prompt: Video-and-Language Pre-training with Entity Prompts Code
#31OA-Trans
23.4
text-to-video R@1· 2021-12-01
Object-aware Video-language Pre-training for Retrieval Code
#32LaT
23.4
text-to-video R@1· 2022-07-11
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
#33A. Nagrani et. al.
19.4
text-to-video R@1· Extra Data· 2022-04-01
Learning Audio-Video Modalities from Image Captions
#34HD-VILA
14.6
text-to-video R@1· 2021-11-19
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions Code
#35Norton
10.7
text-to-video R@1· 2024-01-30
Multi-granularity Correspondence Learning from Long-term Noisy Videos Code
#36VideoCLIP
10.4
text-to-video R@1· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code
#37MIL-NCESOTA
9.9
text-to-video R@1· 2019-12-13
End-to-End Learning of Visual Representations from Uncurated Instructional Videos Code
#38TACo
9.8
text-to-video R@1· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#39SSML
8
text-to-video R@1· 2020-03-06
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning Code