Video Retrieval on MSR-VTT-1kA

Metric: text-to-video R@1 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video R@1▼	Extra Data	Paper	Date↕	Code
1	HunYuan_tvr (huge)	62.9	Yes	Tencent Text-Video Retrieval: Hierarchical Cross...	2022-04-07	-
2	CLIP-ViP	57.7	Yes	CLIP-ViP: Adapting Pre-trained Image-Text Model ...	2022-09-14	Code
3	PIDRo	55.9	No	-	-	-
4	DMAE (ViT-B/16)	55.5	No	Dual-Modal Attention-Enhanced Text-Video Retriev...	2023-09-20	Code
5	HunYuan_tvr	55	Yes	Tencent Text-Video Retrieval: Hierarchical Cross...	2022-04-07	-
6	MuLTI	54.7	No	MuLTI: Efficient Video-and-Language Understandin...	2023-03-10	-
7	STAN	54.1	Yes	Revisiting Temporal Modeling for CLIP-based Imag...	2023-01-26	Code
8	EERCF	54.1	No	Towards Efficient and Effective Text-to-Video Re...	2024-01-01	Code
9	TS2-Net	54	No	TS2-Net: Token Shift and Selection Transformer f...	2022-07-16	Code
10	RTQ	53.4	No	RTQ: Rethinking Video-language Understanding Bas...	2023-12-01	Code
11	DRL	53.3	Yes	Disentangled Representation Learning for Text-Vi...	2022-03-14	Code
12	mPLUG-2	53.1	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
13	CLIP2TV	52.9	Yes	CLIP2TV: Align, Match and Distill for Video-Text...	2021-11-10	-
14	Side4Video	52.3	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
15	EMCL-Net++	51.6	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
16	Cap4Video	51.4	No	Cap4Video: What Can Auxiliary Captions Do for Te...	2022-12-31	Code
17	SuMA (ViT-B/16)	49.8	No	Video-Text Retrieval by Supervised Sparse Multi-...	2023-02-19	Code
18	X2-VLM (large)	49.6	No	X$^2$-VLM: All-In-One Pre-trained Model For Visi...	2022-11-22	Code
19	UCoFiA	49.4	No	Unified Coarse-to-Fine Alignment for Video-Text ...	2023-09-18	Code
20	X-CLIP	49.3	No	X-CLIP: End-to-End Multi-grained Contrastive Lea...	2022-07-15	Code
21	DiffusionRet	49	No	DiffusionRet: Generative Text-Video Retrieval wi...	2023-03-17	Code
22	DiffusionRet+QB-Norm	48.9	No	DiffusionRet: Generative Text-Video Retrieval wi...	2023-03-17	Code
23	CAMoE	48.8	Yes	Improving Video-Text Retrieval by Multi-Stream C...	2021-09-09	Code
24	HBI	48.6	No	Video-Text as Game Players: Hierarchical Banzhaf...	2023-03-25	Code
25	PAU	48.5	No	Prototype-based Aleatoric Uncertainty Quantifica...	2023-09-29	Code
26	CenterCLIP (ViT-B/16)	48.4	Yes	CenterCLIP: Token Clustering for Efficient Text-...	2022-05-02	Code
27	TeachCLIP (ViT-B/16)	48	No	-	-	Code
28	X2-VLM (base)	47.6	No	X$^2$-VLM: All-In-One Pre-trained Model For Visi...	2022-11-22	Code
29	QB-Norm+CLIP2Video	47.2	Yes	Cross Modal Retrieval with Querybank Normalisation	2021-12-23	Code
30	X-Pool	46.9	Yes	X-Pool: Cross-Modal Language-Video Attention for...	2022-03-28	Code
31	TeachCLIP	46.8	No	-	-	Code
32	EMCL-Net	46.8	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
33	HiTeA	46.8	No	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
34	VindLU	46.5	Yes	VindLU: A Recipe for Effective Video-and-Languag...	2022-12-09	Code
35	LAFF	45.8	No	Lightweight Attentional Feature Fusion: A New Ba...	2021-12-03	Code
36	CLIP2Video	45.6	Yes	CLIP2Video: Mastering Video-Text Retrieval via I...	2021-06-21	Code
37	Singularity	41.5	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
38	All-in-one + MELTR	41.3	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
39	Clover	40.5	No	Clover: Towards A Unified Video-Language Alignme...	2022-07-16	Code
40	MDMMT	38.9	Yes	MDMMT: Multidomain Multimodal Transformer for Vi...	2021-03-19	Code
41	MAC	38.9	Yes	Masked Contrastive Pre-Training for Efficient Vi...	2022-12-02	-
42	All-in-one-B	37.9	Yes	All in One: Exploring Unified Video-Language Pre...	2022-03-14	Code
43	BridgeFormer	37.6	Yes	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
44	Florence	37.6	Yes	Florence: A New Foundation Model for Computer Vi...	2021-11-22	Code
45	COTS	36.8	Yes	COTS: Collaborative Two-Stream Vision-Language P...	2022-04-15	-
46	VIOLET + MELTR	35.5	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
47	CLIP	31.2	Yes	A Straightforward Framework For Video Retrieval ...	2021-02-24	Code
48	UniVL + MELTR	31.1	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
49	FROZEN	31	Yes	Frozen in Time: A Joint Video and Image Encoder ...	2021-04-01	Code
50	VideoCLIP	30.9	Yes	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code
51	TACo	28.4	No	TACo: Token-aware Cascade Contrastive Learning f...	2021-08-23	-
52	VLM	28.1	Yes	VLM: Task-agnostic Video-Language Model Pre-trai...	2021-05-20	Code
53	MMT-Pretrained	26.6	Yes	Multi-modal Transformer for Video Retrieval	2020-07-21	Code
54	BridgeFormer (Zero-shot)	26	No	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
55	MMT	24.6	No	Multi-modal Transformer for Video Retrieval	2020-07-21	Code
56	Collaborative Experts	20.9	Yes	Use What You Have: Video Retrieval Using Represe...	2019-07-31	Code
57	HT-Pretrained	14.9	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
58	HT	12.1	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
59	JSFusion	10.2	No	A Joint Sequence Fusion Model for Video Question...	2018-08-07	Code

#1HunYuan_tvr (huge)SOTA
62.9
text-to-video R@1· Extra Data· 2022-04-07
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
#2CLIP-ViP
57.7
text-to-video R@1· Extra Data· 2022-09-14
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment Code
#3PIDRo
55.9
text-to-video R@1
No paper
#4DMAE (ViT-B/16)
55.5
text-to-video R@1· 2023-09-20
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning Code
#5HunYuan_tvr
55
text-to-video R@1· Extra Data· 2022-04-07
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
#6MuLTI
54.7
text-to-video R@1· 2023-03-10
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
#7STAN
54.1
text-to-video R@1· Extra Data· 2023-01-26
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring Code
#8EERCF
54.1
text-to-video R@1· 2024-01-01
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning Code
#9TS2-Net
54
text-to-video R@1· 2022-07-16
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval Code
#10RTQ
53.4
text-to-video R@1· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model Code
#11DRLSOTA
53.3
text-to-video R@1· Extra Data· 2022-03-14
Disentangled Representation Learning for Text-Video Retrieval Code
#12mPLUG-2
53.1
text-to-video R@1· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#13CLIP2TVSOTA
52.9
text-to-video R@1· Extra Data· 2021-11-10
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
#14Side4Video
52.3
text-to-video R@1· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#15EMCL-Net++
51.6
text-to-video R@1· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#16Cap4Video
51.4
text-to-video R@1· 2022-12-31
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Code
#17SuMA (ViT-B/16)
49.8
text-to-video R@1· 2023-02-19
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning Code
#18X2-VLM (large)
49.6
text-to-video R@1· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Code
#19UCoFiA
49.4
text-to-video R@1· 2023-09-18
Unified Coarse-to-Fine Alignment for Video-Text Retrieval Code
#20X-CLIP
49.3
text-to-video R@1· 2022-07-15
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval Code
#21DiffusionRet
49
text-to-video R@1· 2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Code
#22DiffusionRet+QB-Norm
48.9
text-to-video R@1· 2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Code
#23CAMoESOTA
48.8
text-to-video R@1· Extra Data· 2021-09-09
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss Code
#24HBI
48.6
text-to-video R@1· 2023-03-25
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Code
#25PAU
48.5
text-to-video R@1· 2023-09-29
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval Code
#26CenterCLIP (ViT-B/16)
48.4
text-to-video R@1· Extra Data· 2022-05-02
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval Code
#27TeachCLIP (ViT-B/16)
48
text-to-video R@1
No paperCode
#28X2-VLM (base)
47.6
text-to-video R@1· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Code
#29QB-Norm+CLIP2Video
47.2
text-to-video R@1· Extra Data· 2021-12-23
Cross Modal Retrieval with Querybank Normalisation Code
#30X-Pool
46.9
text-to-video R@1· Extra Data· 2022-03-28
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval Code
#31TeachCLIP
46.8
text-to-video R@1
No paperCode
#32EMCL-Net
46.8
text-to-video R@1· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#33HiTeA
46.8
text-to-video R@1· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#34VindLU
46.5
text-to-video R@1· Extra Data· 2022-12-09
VindLU: A Recipe for Effective Video-and-Language Pretraining Code
#35LAFF
45.8
text-to-video R@1· 2021-12-03
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval Code
#36CLIP2VideoSOTA
45.6
text-to-video R@1· Extra Data· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP Code
#37Singularity
41.5
text-to-video R@1· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#38All-in-one + MELTR
41.3
text-to-video R@1· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#39Clover
40.5
text-to-video R@1· 2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model Code
#40MDMMTSOTA
38.9
text-to-video R@1· Extra Data· 2021-03-19
MDMMT: Multidomain Multimodal Transformer for Video Retrieval Code
#41MAC
38.9
text-to-video R@1· Extra Data· 2022-12-02
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
#42All-in-one-B
37.9
text-to-video R@1· Extra Data· 2022-03-14
All in One: Exploring Unified Video-Language Pre-training Code
#43BridgeFormer
37.6
text-to-video R@1· Extra Data· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#44Florence
37.6
text-to-video R@1· Extra Data· 2021-11-22
Florence: A New Foundation Model for Computer Vision Code
#45COTS
36.8
text-to-video R@1· Extra Data· 2022-04-15
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
#46VIOLET + MELTR
35.5
text-to-video R@1· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#47CLIPSOTA
31.2
text-to-video R@1· Extra Data· 2021-02-24
A Straightforward Framework For Video Retrieval Using CLIP Code
#48UniVL + MELTR
31.1
text-to-video R@1· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#49FROZEN
31
text-to-video R@1· Extra Data· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Code
#50VideoCLIP
30.9
text-to-video R@1· Extra Data· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code
#51TACo
28.4
text-to-video R@1· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#52VLM
28.1
text-to-video R@1· Extra Data· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Code
#53MMT-PretrainedSOTA
26.6
text-to-video R@1· Extra Data· 2020-07-21
Multi-modal Transformer for Video Retrieval Code
#54BridgeFormer (Zero-shot)
26
text-to-video R@1· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#55MMT
24.6
text-to-video R@1· 2020-07-21
Multi-modal Transformer for Video Retrieval Code
#56Collaborative ExpertsSOTA
20.9
text-to-video R@1· Extra Data· 2019-07-31
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Code
#57HT-PretrainedSOTA
14.9
text-to-video R@1· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#58HT
12.1
text-to-video R@1· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#59JSFusionSOTA
10.2
text-to-video R@1· 2018-08-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval Code