Video on MSR-VTT-1kA

Metric: text-to-video Median Rank (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video Median Rank▼	Extra Data	Paper	Date↕	Code
1	JSFusion	13	No	A Joint Sequence Fusion Model for Video Question...	2018-08-07	Code
2	HT	12	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
3	HT-Pretrained	9	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
4	BridgeFormer (Zero-shot)	7	No	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
5	Collaborative Experts	6	Yes	Use What You Have: Video Retrieval Using Represe...	2019-07-31	Code
6	CLIP	4	Yes	A Straightforward Framework For Video Retrieval ...	2021-02-24	Code
7	UniVL + MELTR	4	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
8	TACo	4	No	TACo: Token-aware Cascade Contrastive Learning f...	2021-08-23	-
9	VLM	4	Yes	VLM: Task-agnostic Video-Language Model Pre-trai...	2021-05-20	Code
10	MMT-Pretrained	4	Yes	Multi-modal Transformer for Video Retrieval	2020-07-21	Code
11	MMT	4	No	Multi-modal Transformer for Video Retrieval	2020-07-21	Code
12	MAC	3	Yes	Masked Contrastive Pre-Training for Efficient Vi...	2022-12-02	-
13	BridgeFormer	3	Yes	Bridging Video-text Retrieval with Multiple Choi...	2022-01-13	Code
14	VIOLET + MELTR	3	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
15	FROZEN	3	Yes	Frozen in Time: A Joint Video and Image Encoder ...	2021-04-01	Code
16	X-CLIP	2	No	X-CLIP: End-to-End Multi-grained Contrastive Lea...	2022-07-15	Code
17	DiffusionRet	2	No	DiffusionRet: Generative Text-Video Retrieval wi...	2023-03-17	Code
18	DiffusionRet+QB-Norm	2	No	DiffusionRet: Generative Text-Video Retrieval wi...	2023-03-17	Code
19	CAMoE	2	Yes	Improving Video-Text Retrieval by Multi-Stream C...	2021-09-09	Code
20	HBI	2	No	Video-Text as Game Players: Hierarchical Banzhaf...	2023-03-25	Code
21	PAU	2	No	Prototype-based Aleatoric Uncertainty Quantifica...	2023-09-29	Code
22	CenterCLIP (ViT-B/16)	2	Yes	CenterCLIP: Token Clustering for Efficient Text-...	2022-05-02	Code
23	QB-Norm+CLIP2Video	2	Yes	Cross Modal Retrieval with Querybank Normalisation	2021-12-23	Code
24	X-Pool	2	Yes	X-Pool: Cross-Modal Language-Video Attention for...	2022-03-28	Code
25	CLIP2Video	2	Yes	CLIP2Video: Mastering Video-Text Retrieval via I...	2021-06-21	Code
26	Clover	2	No	Clover: Towards A Unified Video-Language Alignme...	2022-07-16	Code
27	MDMMT	2	Yes	MDMMT: Multidomain Multimodal Transformer for Vi...	2021-03-19	Code
28	COTS	2	Yes	COTS: Collaborative Two-Stream Vision-Language P...	2022-04-15	-
29	CLIP4Clip	2	Yes	CLIP4Clip: An Empirical Study of CLIP for End to...	2021-04-18	Code
30	HunYuan_tvr (huge)	1	Yes	Tencent Text-Video Retrieval: Hierarchical Cross...	2022-04-07	-
31	CLIP-ViP	1	Yes	CLIP-ViP: Adapting Pre-trained Image-Text Model ...	2022-09-14	Code
32	PIDRo	1	No	-	-	-
33	DMAE (ViT-B/16)	1	No	Dual-Modal Attention-Enhanced Text-Video Retriev...	2023-09-20	Code
34	STAN	1	Yes	Revisiting Temporal Modeling for CLIP-based Imag...	2023-01-26	Code
35	DRL	1	Yes	Disentangled Representation Learning for Text-Vi...	2022-03-14	Code
36	CLIP2TV	1	Yes	CLIP2TV: Align, Match and Distill for Video-Text...	2021-11-10	-
37	Side4Video	1	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
38	Cap4Video	1	No	Cap4Video: What Can Auxiliary Captions Do for Te...	2022-12-31	Code

#1JSFusionSOTA
13
text-to-video Median Rank· 2018-08-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval Code
#2HT
12
text-to-video Median Rank· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#3HT-Pretrained
9
text-to-video Median Rank· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#4BridgeFormer (Zero-shot)
7
text-to-video Median Rank· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#5Collaborative Experts
6
text-to-video Median Rank· Extra Data· 2019-07-31
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Code
#6CLIP
4
text-to-video Median Rank· Extra Data· 2021-02-24
A Straightforward Framework For Video Retrieval Using CLIP Code
#7UniVL + MELTR
4
text-to-video Median Rank· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#8TACo
4
text-to-video Median Rank· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#9VLM
4
text-to-video Median Rank· Extra Data· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Code
#10MMT-Pretrained
4
text-to-video Median Rank· Extra Data· 2020-07-21
Multi-modal Transformer for Video Retrieval Code
#11MMT
4
text-to-video Median Rank· 2020-07-21
Multi-modal Transformer for Video Retrieval Code
#12MAC
3
text-to-video Median Rank· Extra Data· 2022-12-02
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
#13BridgeFormer
3
text-to-video Median Rank· Extra Data· 2022-01-13
Bridging Video-text Retrieval with Multiple Choice Questions Code
#14VIOLET + MELTR
3
text-to-video Median Rank· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#15FROZEN
3
text-to-video Median Rank· Extra Data· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Code
#16X-CLIP
2
text-to-video Median Rank· 2022-07-15
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval Code
#17DiffusionRet
2
text-to-video Median Rank· 2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Code
#18DiffusionRet+QB-Norm
2
text-to-video Median Rank· 2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Code
#19CAMoE
2
text-to-video Median Rank· Extra Data· 2021-09-09
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss Code
#20HBI
2
text-to-video Median Rank· 2023-03-25
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Code
#21PAU
2
text-to-video Median Rank· 2023-09-29
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval Code
#22CenterCLIP (ViT-B/16)
2
text-to-video Median Rank· Extra Data· 2022-05-02
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval Code
#23QB-Norm+CLIP2Video
2
text-to-video Median Rank· Extra Data· 2021-12-23
Cross Modal Retrieval with Querybank Normalisation Code
#24X-Pool
2
text-to-video Median Rank· Extra Data· 2022-03-28
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval Code
#25CLIP2Video
2
text-to-video Median Rank· Extra Data· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP Code
#26Clover
2
text-to-video Median Rank· 2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model Code
#27MDMMT
2
text-to-video Median Rank· Extra Data· 2021-03-19
MDMMT: Multidomain Multimodal Transformer for Video Retrieval Code
#28COTS
2
text-to-video Median Rank· Extra Data· 2022-04-15
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
#29CLIP4Clip
2
text-to-video Median Rank· Extra Data· 2021-04-18
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Code
#30HunYuan_tvr (huge)
1
text-to-video Median Rank· Extra Data· 2022-04-07
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
#31CLIP-ViP
1
text-to-video Median Rank· Extra Data· 2022-09-14
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment Code
#32PIDRo
1
text-to-video Median Rank
No paper
#33DMAE (ViT-B/16)
1
text-to-video Median Rank· 2023-09-20
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning Code
#34STAN
1
text-to-video Median Rank· Extra Data· 2023-01-26
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring Code
#35DRL
1
text-to-video Median Rank· Extra Data· 2022-03-14
Disentangled Representation Learning for Text-Video Retrieval Code
#36CLIP2TV
1
text-to-video Median Rank· Extra Data· 2021-11-10
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
#37Side4Video
1
text-to-video Median Rank· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#38Cap4Video
1
text-to-video Median Rank· 2022-12-31
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Code