Video on MSR-VTT-1kA

Metric: text-to-video R@10 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video R@10▼	Extra Data	Paper	Date↕	Code
1	HunYuan_tvr (huge)	90.8	Yes	Tencent Text-Video Retrieval: Hierarchical Cross...	2022-04-07	-
2	OmniVec	89.4	Yes	OmniVec: Learning robust representations with cr...	2023-11-07	-
3	CLIP-ViP	88.2	Yes	CLIP-ViP: Adapting Pre-trained Image-Text Model ...	2022-09-14	Code
4	STAN	87.8	Yes	Revisiting Temporal Modeling for CLIP-based Imag...	2023-01-26	Code
5	PIDRo	87.6	No	-	-	-
6	DRL	87.6	Yes	Disentangled Representation Learning for Text-Vi...	2022-03-14	Code
7	TS2-Net	87.4	No	TS2-Net: Token Shift and Selection Transformer f...	2022-07-16	Code
8	DMAE (ViT-B/16)	87.1	No	Dual-Modal Attention-Enhanced Text-Video Retriev...	2023-09-20	Code
9	EERCF	86.9	No	Towards Efficient and Effective Text-to-Video Re...	2024-01-01	Code
10	CLIP2TV	86.5	Yes	CLIP2TV: Align, Match and Distill for Video-Text...	2021-11-10	-
11	MuLTI	86	No	MuLTI: Efficient Video-and-Language Understandin...	2023-03-10	-
12	EMCL-Net++	85.3	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
13	CAMoE	85.3	Yes	Improving Video-Text Retrieval by Multi-Stream C...	2021-09-09	Code
14	X-CLIP	84.8	No	X-CLIP: End-to-End Multi-grained Contrastive Lea...	2022-07-15	Code
15	mPLUG-2	84.7	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
16	RTQ	84.4	No	RTQ: Rethinking Video-language Understanding Bas...	2023-12-01	Code
17	Side4Video	84.2	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
18	X2-VLM (large)	84.2	No	X$^2$-VLM: All-In-One Pre-trained Model For Visi...	2022-11-22	Code
19	X2-VLM (base)	84.2	No	X$^2$-VLM: All-In-One Pre-trained Model For Visi...	2022-11-22	Code
20	Cap4Video	83.9	No	Cap4Video: What Can Auxiliary Captions Do for Te...	2022-12-31	Code
21	SuMA (ViT-B/16)	83.9	No	Video-Text Retrieval by Supervised Sparse Multi-...	2023-02-19	Code
22	UCoFiA	83.5	No	Unified Coarse-to-Fine Alignment for Video-Text ...	2023-09-18	Code
23	TeachCLIP (ViT-B/16)	83.5	No	-	-	Code
24	HBI	83.4	No	Video-Text as Game Players: Hierarchical Banzhaf...	2023-03-25	Code
25	DiffusionRet+QB-Norm	83.1	No	DiffusionRet: Generative Text-Video Retrieval wi...	2023-03-17	Code
26	EMCL-Net	83.1	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
27	QB-Norm+CLIP2Video	83	Yes	Cross Modal Retrieval with Querybank Normalisation	2021-12-23	Code
28	DiffusionRet	82.7	No	DiffusionRet: Generative Text-Video Retrieval wi...	2023-03-17	Code
29	TeachCLIP	82.6	No	-	-	Code
30	PAU	82.5	No	Prototype-based Aleatoric Uncertainty Quantifica...	2023-09-29	Code
31	All-in-one + MELTR	82.5	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
32	X-Pool	82.2	Yes	X-Pool: Cross-Modal Language-Video Attention for...	2022-03-28	Code
33	CenterCLIP (ViT-B/16)	82	Yes	CenterCLIP: Token Clustering for Efficient Text-...	2022-05-02	Code
34	LAFF	82	No	Lightweight Attentional Feature Fusion: A New Ba...	2021-12-03	Code
35	HiTeA	81.9	No	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
36	CLIP2Video	81.7	Yes	CLIP2Video: Mastering Video-Text Retrieval via I...	2021-06-21	Code
37	CLIP4Clip	81.6	Yes	CLIP4Clip: An Empirical Study of CLIP for End to...	2021-04-18	Code
38	VindLU	80.4	Yes	VindLU: A Recipe for Effective Video-and-Languag...	2022-12-09	Code
39	MDMMT	79.7	Yes	MDMMT: Multidomain Multimodal Transformer for Vi...	2021-03-19	Code
40	Clover	79.4	No	Clover: Towards A Unified Video-Language Alignme...	2022-07-16	Code
41	OmniVec (pretrained)	78.6	Yes	OmniVec: Learning robust representations with cr...	2023-11-07	-
42	VIOLET + MELTR	78.4	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
43	All-in-one-B	77.1	Yes	All in One: Exploring Unified Video-Language Pre...	2022-03-14	Code
44	Singularity	77	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
45	BridgeFormer	75.1	Yes	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
46	MAC	73.9	Yes	Masked Contrastive Pre-Training for Efficient Vi...	2022-12-02	-
47	COTS	73.2	Yes	COTS: Collaborative Two-Stream Vision-Language P...	2022-04-15	-
48	Florence	72.6	Yes	Florence: A New Foundation Model for Computer Vi...	2021-11-22	Code
49	TACo	71.2	No	TACo: Token-aware Cascade Contrastive Learning f...	2021-08-23	-
50	FROZEN	70.5	Yes	Frozen in Time: A Joint Video and Image Encoder ...	2021-04-01	Code
51	MMT-Pretrained	69.6	Yes	Multi-modal Transformer for Video Retrieval	2020-07-21	Code
52	UniVL + MELTR	68.3	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
53	VLM	67.4	Yes	VLM: Task-agnostic Video-Language Model Pre-trai...	2021-05-20	Code
54	MMT	67.1	No	Multi-modal Transformer for Video Retrieval	2020-07-21	Code
55	VideoCLIP	66.8	Yes	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code
56	CLIP	64.2	Yes	A Straightforward Framework For Video Retrieval ...	2021-02-24	Code
57	Collaborative Experts	62.4	Yes	Use What You Have: Video Retrieval Using Represe...	2019-07-31	Code
58	BridgeFormer (Zero-shot)	56.4	No	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
59	HT-Pretrained	52.8	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
60	HT	48	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
61	JSFusion	43.2	No	A Joint Sequence Fusion Model for Video Question...	2018-08-07	Code

#1HunYuan_tvr (huge)SOTA
90.8
text-to-video R@10· Extra Data· 2022-04-07
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
#2OmniVec
89.4
text-to-video R@10· Extra Data· 2023-11-07
OmniVec: Learning robust representations with cross modal sharing
#3CLIP-ViP
88.2
text-to-video R@10· Extra Data· 2022-09-14
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment Code
#4STAN
87.8
text-to-video R@10· Extra Data· 2023-01-26
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring Code
#5PIDRo
87.6
text-to-video R@10
No paper
#6DRLSOTA
87.6
text-to-video R@10· Extra Data· 2022-03-14
Disentangled Representation Learning for Text-Video Retrieval Code
#7TS2-Net
87.4
text-to-video R@10· 2022-07-16
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval Code
#8DMAE (ViT-B/16)
87.1
text-to-video R@10· 2023-09-20
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning Code
#9EERCF
86.9
text-to-video R@10· 2024-01-01
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning Code
#10CLIP2TVSOTA
86.5
text-to-video R@10· Extra Data· 2021-11-10
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
#11MuLTI
86
text-to-video R@10· 2023-03-10
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
#12EMCL-Net++
85.3
text-to-video R@10· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#13CAMoESOTA
85.3
text-to-video R@10· Extra Data· 2021-09-09
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss Code
#14X-CLIP
84.8
text-to-video R@10· 2022-07-15
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval Code
#15mPLUG-2
84.7
text-to-video R@10· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#16RTQ
84.4
text-to-video R@10· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model Code
#17Side4Video
84.2
text-to-video R@10· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#18X2-VLM (large)
84.2
text-to-video R@10· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Code
#19X2-VLM (base)
84.2
text-to-video R@10· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Code
#20Cap4Video
83.9
text-to-video R@10· 2022-12-31
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Code
#21SuMA (ViT-B/16)
83.9
text-to-video R@10· 2023-02-19
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning Code
#22UCoFiA
83.5
text-to-video R@10· 2023-09-18
Unified Coarse-to-Fine Alignment for Video-Text Retrieval Code
#23TeachCLIP (ViT-B/16)
83.5
text-to-video R@10
No paperCode
#24HBI
83.4
text-to-video R@10· 2023-03-25
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Code
#25DiffusionRet+QB-Norm
83.1
text-to-video R@10· 2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Code
#26EMCL-Net
83.1
text-to-video R@10· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#27QB-Norm+CLIP2Video
83
text-to-video R@10· Extra Data· 2021-12-23
Cross Modal Retrieval with Querybank Normalisation Code
#28DiffusionRet
82.7
text-to-video R@10· 2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Code
#29TeachCLIP
82.6
text-to-video R@10
No paperCode
#30PAU
82.5
text-to-video R@10· 2023-09-29
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval Code
#31All-in-one + MELTR
82.5
text-to-video R@10· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#32X-Pool
82.2
text-to-video R@10· Extra Data· 2022-03-28
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval Code
#33CenterCLIP (ViT-B/16)
82
text-to-video R@10· Extra Data· 2022-05-02
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval Code
#34LAFF
82
text-to-video R@10· 2021-12-03
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval Code
#35HiTeA
81.9
text-to-video R@10· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#36CLIP2VideoSOTA
81.7
text-to-video R@10· Extra Data· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP Code
#37CLIP4ClipSOTA
81.6
text-to-video R@10· Extra Data· 2021-04-18
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Code
#38VindLU
80.4
text-to-video R@10· Extra Data· 2022-12-09
VindLU: A Recipe for Effective Video-and-Language Pretraining Code
#39MDMMTSOTA
79.7
text-to-video R@10· Extra Data· 2021-03-19
MDMMT: Multidomain Multimodal Transformer for Video Retrieval Code
#40Clover
79.4
text-to-video R@10· 2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model Code
#41OmniVec (pretrained)
78.6
text-to-video R@10· Extra Data· 2023-11-07
OmniVec: Learning robust representations with cross modal sharing
#42VIOLET + MELTR
78.4
text-to-video R@10· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#43All-in-one-B
77.1
text-to-video R@10· Extra Data· 2022-03-14
All in One: Exploring Unified Video-Language Pre-training Code
#44Singularity
77
text-to-video R@10· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#45BridgeFormer
75.1
text-to-video R@10· Extra Data· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#46MAC
73.9
text-to-video R@10· Extra Data· 2022-12-02
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
#47COTS
73.2
text-to-video R@10· Extra Data· 2022-04-15
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
#48Florence
72.6
text-to-video R@10· Extra Data· 2021-11-22
Florence: A New Foundation Model for Computer Vision Code
#49TACo
71.2
text-to-video R@10· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#50FROZEN
70.5
text-to-video R@10· Extra Data· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Code
#51MMT-PretrainedSOTA
69.6
text-to-video R@10· Extra Data· 2020-07-21
Multi-modal Transformer for Video Retrieval Code
#52UniVL + MELTR
68.3
text-to-video R@10· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#53VLM
67.4
text-to-video R@10· Extra Data· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Code
#54MMT
67.1
text-to-video R@10· 2020-07-21
Multi-modal Transformer for Video Retrieval Code
#55VideoCLIP
66.8
text-to-video R@10· Extra Data· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code
#56CLIP
64.2
text-to-video R@10· Extra Data· 2021-02-24
A Straightforward Framework For Video Retrieval Using CLIP Code
#57Collaborative ExpertsSOTA
62.4
text-to-video R@10· Extra Data· 2019-07-31
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Code
#58BridgeFormer (Zero-shot)
56.4
text-to-video R@10· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#59HT-PretrainedSOTA
52.8
text-to-video R@10· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#60HT
48
text-to-video R@10· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#61JSFusionSOTA
43.2
text-to-video R@10· 2018-08-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval Code