Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video
/
MSR-VTT
Video on MSR-VTT
Metric: text-to-video Median Rank (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
text-to-video Median Rank (best first)
text-to-video Median Rank (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
text-to-video Median Rank
▼
Extra Data
Paper
Date
↕
Code
1
C+LSTM+SA+FC7
55
No
Learning Language-Visual Embedding for Movie Und...
2016-09-26
-
2
Kaufman
41
No
Temporal Tessellation: A Unified Approach for Vi...
2016-12-21
Code
3
JEMC
29.7
No
-
-
Code
4
RoME
17
No
RoME: Role-aware Mixture-of-Expert Transformer f...
2022-06-26
Code
5
Collaborative Experts
16
No
Use What You Have: Video Retrieval Using Represe...
2019-07-31
Code
6
JSFusion
13
No
A Joint Sequence Fusion Model for Video Question...
2018-08-07
Code
7
CLIP
10
No
A Straightforward Framework For Video Retrieval ...
2021-02-24
Code
8
Text-Video Embedding
9
No
HowTo100M: Learning a Text-Video Embedding by Wa...
2019-06-07
Code
9
MDMMT
6
Yes
MDMMT: Multidomain Multimodal Transformer for Vi...
2021-03-19
Code
10
UniVL
6
Yes
UniVL: A Unified Video and Language Pre-Training...
2020-02-15
Code
11
TACo
5
Yes
TACo: Token-aware Cascade Contrastive Learning f...
2021-08-23
-
12
CLIP2Video
4
Yes
CLIP2Video: Mastering Video-Text Retrieval via I...
2021-06-21
Code
13
UniVL + MELTR
4
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
14
MDMMT-2
3
Yes
MDMMT-2: Multidomain Multimodal Transformer for ...
2022-03-14
-
15
VIOLET + MELTR
3
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
16
CLIP2TV
3
Yes
CLIP2TV: Align, Match and Distill for Video-Text...
2021-11-10
-
17
CAMoE
3
Yes
Improving Video-Text Retrieval by Multi-Stream C...
2021-09-09
Code
18
COTS
3
No
COTS: Collaborative Two-Stream Vision-Language P...
2022-04-15
-
19
Ours
3
No
Video and Text Matching with Conditioned Embeddi...
2021-10-21
Code
#1
C+LSTM+SA+FC7
SOTA
55
text-to-video Median Rank
· 2016-09-26
Learning Language-Visual Embedding for Movie Understanding with Natural-Language
#2
Kaufman
41
text-to-video Median Rank
· 2016-12-21
Temporal Tessellation: A Unified Approach for Video Analysis
Code
#3
JEMC
29.7
text-to-video Median Rank
No paper
Code
#4
RoME
17
text-to-video Median Rank
· 2022-06-26
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
Code
#5
Collaborative Experts
16
text-to-video Median Rank
· 2019-07-31
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
Code
#6
JSFusion
13
text-to-video Median Rank
· 2018-08-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
Code
#7
CLIP
10
text-to-video Median Rank
· 2021-02-24
A Straightforward Framework For Video Retrieval Using CLIP
Code
#8
Text-Video Embedding
9
text-to-video Median Rank
· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Code
#9
MDMMT
6
text-to-video Median Rank
· Extra Data
· 2021-03-19
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Code
#10
UniVL
6
text-to-video Median Rank
· Extra Data
· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Code
#11
TACo
5
text-to-video Median Rank
· Extra Data
· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#12
CLIP2Video
4
text-to-video Median Rank
· Extra Data
· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Code
#13
UniVL + MELTR
4
text-to-video Median Rank
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#14
MDMMT-2
3
text-to-video Median Rank
· Extra Data
· 2022-03-14
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
#15
VIOLET + MELTR
3
text-to-video Median Rank
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#16
CLIP2TV
3
text-to-video Median Rank
· Extra Data
· 2021-11-10
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
#17
CAMoE
3
text-to-video Median Rank
· Extra Data
· 2021-09-09
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Code
#18
COTS
3
text-to-video Median Rank
· 2022-04-15
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
#19
Ours
3
text-to-video Median Rank
· 2021-10-21
Video and Text Matching with Conditioned Embeddings
Code