Zero-Shot Video Retrieval on DiDeMo

Metric: text-to-video R@1 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video R@1▼	Extra Data	Paper	Date↕	Code
1	InternVideo2-6B	57.9	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
2	InternVideo2-1B	57	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
3	VAST	55.5	Yes	VAST: A Vision-Audio-Subtitle-Text Omni-Modality...	2023-05-29	Code
4	GRAM	54.2	Yes	Gramian Multimodal Representation Learning and A...	2024-12-16	Code
5	vid-TLDR (UMT-L)	52	Yes	vid-TLDR: Training Free Token merging for Light-...	2024-03-20	Code
6	UMT-L (ViT-L/16)	48.6	Yes	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
7	mPLUG-2	45.7	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
8	HiTeA-17M	43.2	Yes	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
9	LanguageBind(ViT-H/14)	39.9	Yes	LanguageBind: Extending Video-Language Pretraini...	2023-10-03	Code
10	LanguageBind(ViT-L/14)	39.7	Yes	LanguageBind: Extending Video-Language Pretraini...	2023-10-03	Code
11	Singularity-17M	37.1	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
12	Singularity-5M	36.9	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
13	HiTeA-5M	36.1	Yes	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
14	BT-Adapter	35.6	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
15	OmniVL	33.3	Yes	OmniVL:One Foundation Model for Image-Language a...	2022-09-15	-
16	InternVideo	31.5	Yes	InternVideo: General Video Foundation Models via...	2022-12-06	Code
17	Clover	29.5	Yes	Clover: Towards A Unified Video-Language Alignme...	2022-07-16	Code
18	MILES	27.2	No	MILES: Visual BERT Pre-training with Injected La...	2022-04-26	Code
19	Y. Ge et. al.	25.6	No	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
20	ALPRO	23.8	No	Align and Prompt: Video-and-Language Pre-trainin...	2021-12-17	Code
21	OA-Trans	23.5	No	Object-aware Video-language Pre-training for Ret...	2021-12-01	Code
22	VIOLET	23.5	No	VIOLET : End-to-End Video-Language Transformers ...	2021-11-24	Code
23	LaT	22.6	No	LaT: Latent Translation with Cycle-Consistency f...	2022-07-11	-
24	FROZEN	21.1	Yes	Frozen in Time: A Joint Video and Image Encoder ...	2021-04-01	Code
25	M. Bain et. al.	20.2	No	Frozen in Time: A Joint Video and Image Encoder ...	2021-04-01	Code
26	VideoCLIP	16.6	No	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code

#1InternVideo2-6BSOTA
57.9
text-to-video R@1· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#2InternVideo2-1B
57
text-to-video R@1· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#3VASTSOTA
55.5
text-to-video R@1· Extra Data· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset Code
#4GRAM
54.2
text-to-video R@1· Extra Data· 2024-12-16
Gramian Multimodal Representation Learning and Alignment Code
#5vid-TLDR (UMT-L)
52
text-to-video R@1· Extra Data· 2024-03-20
vid-TLDR: Training Free Token merging for Light-weight Video Transformer Code
#6UMT-L (ViT-L/16)SOTA
48.6
text-to-video R@1· Extra Data· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#7mPLUG-2SOTA
45.7
text-to-video R@1· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#8HiTeA-17MSOTA
43.2
text-to-video R@1· Extra Data· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#9LanguageBind(ViT-H/14)
39.9
text-to-video R@1· Extra Data· 2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment Code
#10LanguageBind(ViT-L/14)
39.7
text-to-video R@1· Extra Data· 2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment Code
#11Singularity-17MSOTA
37.1
text-to-video R@1· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#12Singularity-5M
36.9
text-to-video R@1· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#13HiTeA-5M
36.1
text-to-video R@1· Extra Data· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#14BT-Adapter
35.6
text-to-video R@1· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#15OmniVL
33.3
text-to-video R@1· Extra Data· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#16InternVideo
31.5
text-to-video R@1· Extra Data· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#17Clover
29.5
text-to-video R@1· Extra Data· 2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model Code
#18MILESSOTA
27.2
text-to-video R@1· 2022-04-26
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval Code
#19Y. Ge et. al.SOTA
25.6
text-to-video R@1· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#20ALPROSOTA
23.8
text-to-video R@1· 2021-12-17
Align and Prompt: Video-and-Language Pre-training with Entity Prompts Code
#21OA-Trans
23.5
text-to-video R@1· 2021-12-01
Object-aware Video-language Pre-training for Retrieval Code
#22VIOLETSOTA
23.5
text-to-video R@1· 2021-11-24
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling Code
#23LaT
22.6
text-to-video R@1· 2022-07-11
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
#24FROZENSOTA
21.1
text-to-video R@1· Extra Data· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Code
#25M. Bain et. al.
20.2
text-to-video R@1· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Code
#26VideoCLIP
16.6
text-to-video R@1· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code