Video on MSR-VTT-1kA

Metric: text-to-video R@5 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video R@5▼	Extra Data	Paper	Date↕	Code
1	HunYuan_tvr (huge)	84.5	Yes	Tencent Text-Video Retrieval: Hierarchical Cross...	2022-04-07	-
2	CLIP-ViP	80.5	Yes	CLIP-ViP: Adapting Pre-trained Image-Text Model ...	2022-09-14	Code
3	DRL	80.3	Yes	Disentangled Representation Learning for Text-Vi...	2022-03-14	Code
4	PIDRo	79.8	No	-	-	-
5	STAN	79.5	Yes	Revisiting Temporal Modeling for CLIP-based Imag...	2023-01-26	Code
6	DMAE (ViT-B/16)	79.4	No	Dual-Modal Attention-Enhanced Text-Video Retriev...	2023-09-20	Code
7	TS2-Net	79.3	No	TS2-Net: Token Shift and Selection Transformer f...	2022-07-16	Code
8	EERCF	78.8	No	Towards Efficient and Effective Text-to-Video Re...	2024-01-01	Code
9	CLIP2TV	78.5	Yes	CLIP2TV: Align, Match and Distill for Video-Text...	2021-11-10	-
10	EMCL-Net++	78.1	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
11	MuLTI	77.7	No	MuLTI: Efficient Video-and-Language Understandin...	2023-03-10	-
12	mPLUG-2	77.6	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
13	X2-VLM (large)	76.7	No	X$^2$-VLM: All-In-One Pre-trained Model For Visi...	2022-11-22	Code
14	RTQ	76.1	No	RTQ: Rethinking Video-language Understanding Bas...	2023-12-01	Code
15	TeachCLIP (ViT-B/16)	75.9	No	-	-	Code
16	X-CLIP	75.8	No	X-CLIP: End-to-End Multi-grained Contrastive Lea...	2022-07-15	Code
17	Cap4Video	75.7	No	Cap4Video: What Can Auxiliary Captions Do for Te...	2022-12-31	Code
18	CAMoE	75.6	Yes	Improving Video-Text Retrieval by Multi-Stream C...	2021-09-09	Code
19	Side4Video	75.5	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
20	DiffusionRet	75.2	No	DiffusionRet: Generative Text-Video Retrieval wi...	2023-03-17	Code
21	DiffusionRet+QB-Norm	75.2	No	DiffusionRet: Generative Text-Video Retrieval wi...	2023-03-17	Code
22	SuMA (ViT-B/16)	75.1	No	Video-Text Retrieval by Supervised Sparse Multi-...	2023-02-19	Code
23	HBI	74.6	No	Video-Text as Game Players: Hierarchical Banzhaf...	2023-03-25	Code
24	TeachCLIP	74.3	No	-	-	Code
25	X2-VLM (base)	74.1	No	X$^2$-VLM: All-In-One Pre-trained Model For Visi...	2022-11-22	Code
26	CenterCLIP (ViT-B/16)	73.8	Yes	CenterCLIP: Token Clustering for Efficient Text-...	2022-05-02	Code
27	All-in-one + MELTR	73.5	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
28	EMCL-Net	73.1	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
29	QB-Norm+CLIP2Video	73	Yes	Cross Modal Retrieval with Querybank Normalisation	2021-12-23	Code
30	X-Pool	72.8	Yes	X-Pool: Cross-Modal Language-Video Attention for...	2022-03-28	Code
31	PAU	72.7	No	Prototype-based Aleatoric Uncertainty Quantifica...	2023-09-29	Code
32	CLIP2Video	72.6	Yes	CLIP2Video: Mastering Video-Text Retrieval via I...	2021-06-21	Code
33	UCoFiA	72.1	No	Unified Coarse-to-Fine Alignment for Video-Text ...	2023-09-18	Code
34	VindLU	71.5	Yes	VindLU: A Recipe for Effective Video-and-Languag...	2022-12-09	Code
35	LAFF	71.5	No	Lightweight Attentional Feature Fusion: A New Ba...	2021-12-03	Code
36	HiTeA	71.2	No	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
37	Clover	69.8	No	Clover: Towards A Unified Video-Language Alignme...	2022-07-16	Code
38	MDMMT	69	Yes	MDMMT: Multidomain Multimodal Transformer for Vi...	2021-03-19	Code
39	Singularity	68.7	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
40	All-in-one-B	68.1	Yes	All in One: Exploring Unified Video-Language Pre...	2022-03-14	Code
41	VIOLET + MELTR	67.2	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
42	BridgeFormer	64.8	Yes	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
43	Florence	63.8	Yes	Florence: A New Foundation Model for Computer Vi...	2021-11-22	Code
44	COTS	63.8	Yes	COTS: Collaborative Two-Stream Vision-Language P...	2022-04-15	-
45	MAC	63.1	Yes	Masked Contrastive Pre-Training for Efficient Vi...	2022-12-02	-
46	FROZEN	59.5	Yes	Frozen in Time: A Joint Video and Image Encoder ...	2021-04-01	Code
47	TACo	57.8	No	TACo: Token-aware Cascade Contrastive Learning f...	2021-08-23	-
48	MMT-Pretrained	57.1	Yes	Multi-modal Transformer for Video Retrieval	2020-07-21	Code
49	UniVL + MELTR	55.7	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
50	VLM	55.5	Yes	VLM: Task-agnostic Video-Language Model Pre-trai...	2021-05-20	Code
51	VideoCLIP	55.4	Yes	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code
52	MMT	54	No	Multi-modal Transformer for Video Retrieval	2020-07-21	Code
53	CLIP	53.7	Yes	A Straightforward Framework For Video Retrieval ...	2021-02-24	Code
54	Collaborative Experts	48.8	Yes	Use What You Have: Video Retrieval Using Represe...	2019-07-31	Code
55	BridgeFormer (Zero-shot)	46.4	No	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
56	HT-Pretrained	40.2	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
57	HT	35	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
58	JSFusion	31.2	No	A Joint Sequence Fusion Model for Video Question...	2018-08-07	Code

#1HunYuan_tvr (huge)SOTA
84.5
text-to-video R@5· Extra Data· 2022-04-07
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
#2CLIP-ViP
80.5
text-to-video R@5· Extra Data· 2022-09-14
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment Code
#3DRLSOTA
80.3
text-to-video R@5· Extra Data· 2022-03-14
Disentangled Representation Learning for Text-Video Retrieval Code
#4PIDRo
79.8
text-to-video R@5
No paper
#5STAN
79.5
text-to-video R@5· Extra Data· 2023-01-26
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring Code
#6DMAE (ViT-B/16)
79.4
text-to-video R@5· 2023-09-20
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning Code
#7TS2-Net
79.3
text-to-video R@5· 2022-07-16
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval Code
#8EERCF
78.8
text-to-video R@5· 2024-01-01
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning Code
#9CLIP2TVSOTA
78.5
text-to-video R@5· Extra Data· 2021-11-10
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
#10EMCL-Net++
78.1
text-to-video R@5· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#11MuLTI
77.7
text-to-video R@5· 2023-03-10
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
#12mPLUG-2
77.6
text-to-video R@5· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#13X2-VLM (large)
76.7
text-to-video R@5· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Code
#14RTQ
76.1
text-to-video R@5· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model Code
#15TeachCLIP (ViT-B/16)
75.9
text-to-video R@5
No paperCode
#16X-CLIP
75.8
text-to-video R@5· 2022-07-15
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval Code
#17Cap4Video
75.7
text-to-video R@5· 2022-12-31
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Code
#18CAMoESOTA
75.6
text-to-video R@5· Extra Data· 2021-09-09
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss Code
#19Side4Video
75.5
text-to-video R@5· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#20DiffusionRet
75.2
text-to-video R@5· 2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Code
#21DiffusionRet+QB-Norm
75.2
text-to-video R@5· 2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Code
#22SuMA (ViT-B/16)
75.1
text-to-video R@5· 2023-02-19
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning Code
#23HBI
74.6
text-to-video R@5· 2023-03-25
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Code
#24TeachCLIP
74.3
text-to-video R@5
No paperCode
#25X2-VLM (base)
74.1
text-to-video R@5· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Code
#26CenterCLIP (ViT-B/16)
73.8
text-to-video R@5· Extra Data· 2022-05-02
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval Code
#27All-in-one + MELTR
73.5
text-to-video R@5· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#28EMCL-Net
73.1
text-to-video R@5· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#29QB-Norm+CLIP2Video
73
text-to-video R@5· Extra Data· 2021-12-23
Cross Modal Retrieval with Querybank Normalisation Code
#30X-Pool
72.8
text-to-video R@5· Extra Data· 2022-03-28
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval Code
#31PAU
72.7
text-to-video R@5· 2023-09-29
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval Code
#32CLIP2VideoSOTA
72.6
text-to-video R@5· Extra Data· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP Code
#33UCoFiA
72.1
text-to-video R@5· 2023-09-18
Unified Coarse-to-Fine Alignment for Video-Text Retrieval Code
#34VindLU
71.5
text-to-video R@5· Extra Data· 2022-12-09
VindLU: A Recipe for Effective Video-and-Language Pretraining Code
#35LAFF
71.5
text-to-video R@5· 2021-12-03
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval Code
#36HiTeA
71.2
text-to-video R@5· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#37Clover
69.8
text-to-video R@5· 2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model Code
#38MDMMTSOTA
69
text-to-video R@5· Extra Data· 2021-03-19
MDMMT: Multidomain Multimodal Transformer for Video Retrieval Code
#39Singularity
68.7
text-to-video R@5· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#40All-in-one-B
68.1
text-to-video R@5· Extra Data· 2022-03-14
All in One: Exploring Unified Video-Language Pre-training Code
#41VIOLET + MELTR
67.2
text-to-video R@5· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#42BridgeFormer
64.8
text-to-video R@5· Extra Data· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#43Florence
63.8
text-to-video R@5· Extra Data· 2021-11-22
Florence: A New Foundation Model for Computer Vision Code
#44COTS
63.8
text-to-video R@5· Extra Data· 2022-04-15
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
#45MAC
63.1
text-to-video R@5· Extra Data· 2022-12-02
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
#46FROZEN
59.5
text-to-video R@5· Extra Data· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Code
#47TACo
57.8
text-to-video R@5· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#48MMT-PretrainedSOTA
57.1
text-to-video R@5· Extra Data· 2020-07-21
Multi-modal Transformer for Video Retrieval Code
#49UniVL + MELTR
55.7
text-to-video R@5· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#50VLM
55.5
text-to-video R@5· Extra Data· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Code
#51VideoCLIP
55.4
text-to-video R@5· Extra Data· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code
#52MMT
54
text-to-video R@5· 2020-07-21
Multi-modal Transformer for Video Retrieval Code
#53CLIP
53.7
text-to-video R@5· Extra Data· 2021-02-24
A Straightforward Framework For Video Retrieval Using CLIP Code
#54Collaborative ExpertsSOTA
48.8
text-to-video R@5· Extra Data· 2019-07-31
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Code
#55BridgeFormer (Zero-shot)
46.4
text-to-video R@5· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#56HT-PretrainedSOTA
40.2
text-to-video R@5· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#57HT
35
text-to-video R@5· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#58JSFusionSOTA
31.2
text-to-video R@5· 2018-08-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval Code