Video on MSR-VTT

Metric: text-to-video Median Rank (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video Median Rank▼	Extra Data	Paper	Date↕	Code
1	C+LSTM+SA+FC7	55	No	Learning Language-Visual Embedding for Movie Und...	2016-09-26	-
2	Kaufman	41	No	Temporal Tessellation: A Unified Approach for Vi...	2016-12-21	Code
3	JEMC	29.7	No	-	-	Code
4	RoME	17	No	RoME: Role-aware Mixture-of-Expert Transformer f...	2022-06-26	Code
5	Collaborative Experts	16	No	Use What You Have: Video Retrieval Using Represe...	2019-07-31	Code
6	JSFusion	13	No	A Joint Sequence Fusion Model for Video Question...	2018-08-07	Code
7	CLIP	10	No	A Straightforward Framework For Video Retrieval ...	2021-02-24	Code
8	Text-Video Embedding	9	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
9	MDMMT	6	Yes	MDMMT: Multidomain Multimodal Transformer for Vi...	2021-03-19	Code
10	UniVL	6	Yes	UniVL: A Unified Video and Language Pre-Training...	2020-02-15	Code
11	TACo	5	Yes	TACo: Token-aware Cascade Contrastive Learning f...	2021-08-23	-
12	CLIP2Video	4	Yes	CLIP2Video: Mastering Video-Text Retrieval via I...	2021-06-21	Code
13	UniVL + MELTR	4	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
14	MDMMT-2	3	Yes	MDMMT-2: Multidomain Multimodal Transformer for ...	2022-03-14	-
15	VIOLET + MELTR	3	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
16	CLIP2TV	3	Yes	CLIP2TV: Align, Match and Distill for Video-Text...	2021-11-10	-
17	CAMoE	3	Yes	Improving Video-Text Retrieval by Multi-Stream C...	2021-09-09	Code
18	COTS	3	No	COTS: Collaborative Two-Stream Vision-Language P...	2022-04-15	-
19	Ours	3	No	Video and Text Matching with Conditioned Embeddi...	2021-10-21	Code

#1C+LSTM+SA+FC7SOTA
55
text-to-video Median Rank· 2016-09-26
Learning Language-Visual Embedding for Movie Understanding with Natural-Language
#2Kaufman
41
text-to-video Median Rank· 2016-12-21
Temporal Tessellation: A Unified Approach for Video Analysis Code
#3JEMC
29.7
text-to-video Median Rank
No paperCode
#4RoME
17
text-to-video Median Rank· 2022-06-26
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval Code
#5Collaborative Experts
16
text-to-video Median Rank· 2019-07-31
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Code
#6JSFusion
13
text-to-video Median Rank· 2018-08-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval Code
#7CLIP
10
text-to-video Median Rank· 2021-02-24
A Straightforward Framework For Video Retrieval Using CLIP Code
#8Text-Video Embedding
9
text-to-video Median Rank· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#9MDMMT
6
text-to-video Median Rank· Extra Data· 2021-03-19
MDMMT: Multidomain Multimodal Transformer for Video Retrieval Code
#10UniVL
6
text-to-video Median Rank· Extra Data· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation Code
#11TACo
5
text-to-video Median Rank· Extra Data· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#12CLIP2Video
4
text-to-video Median Rank· Extra Data· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP Code
#13UniVL + MELTR
4
text-to-video Median Rank· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#14MDMMT-2
3
text-to-video Median Rank· Extra Data· 2022-03-14
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
#15VIOLET + MELTR
3
text-to-video Median Rank· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#16CLIP2TV
3
text-to-video Median Rank· Extra Data· 2021-11-10
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
#17CAMoE
3
text-to-video Median Rank· Extra Data· 2021-09-09
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss Code
#18COTS
3
text-to-video Median Rank· 2022-04-15
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
#19Ours
3
text-to-video Median Rank· 2021-10-21
Video and Text Matching with Conditioned Embeddings Code