Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Retrieval
/
MSR-VTT
Video Retrieval on MSR-VTT
Metric: text-to-video R@10 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
text-to-video R@10 (best first)
text-to-video R@10 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
text-to-video R@10
▼
Extra Data
Paper
Date
↕
Code
1
VAST
89.6
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
2
VALOR
89.6
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
3
GRAM
89.3
Yes
Gramian Multimodal Representation Learning and A...
2024-12-16
Code
4
VLAB
87.6
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
5
UMT-L (ViT-L/16)
87.1
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
6
TEFAL
86.1
No
Audio-Enhanced Text-to-Video Retrieval using Tex...
2023-07-24
-
7
All-in-one + MELTR
84.7
Yes
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
8
OmniVL
83.8
Yes
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
9
UCoFiA
83.5
No
Unified Coarse-to-Fine Alignment for Video-Text ...
2023-09-18
Code
10
Aurora (ours, r=64)
82
No
-
-
-
11
vid-TLDR (UMT-L)
81.6
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
12
CLIP4Clip-seqTransf
81.6
No
CLIP4Clip: An Empirical Study of CLIP for End to...
2021-04-18
Code
13
HD-VILA
78
No
Advancing High-Resolution Video-Language Represe...
2021-11-19
Code
14
VIOLET + MELTR
77.8
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
15
VIOLETv2
75.8
Yes
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
16
FROZEN
71.2
No
Frozen in Time: A Joint Video and Image Encoder ...
2021-04-01
Code
17
MDMMT-2
70.8
Yes
MDMMT-2: Multidomain Multimodal Transformer for ...
2022-03-14
-
18
COTS
70.2
No
COTS: Collaborative Two-Stream Vision-Language P...
2022-04-15
-
19
CLIP2TV
68.9
Yes
CLIP2TV: Align, Match and Distill for Video-Text...
2021-11-10
-
20
CAMoE
68.4
Yes
Improving Video-Text Retrieval by Multi-Stream C...
2021-09-09
Code
21
UniVL + MELTR
67.6
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
22
VideoCoCa (zero-shot)
67
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
23
CLIP2Video
66.2
Yes
CLIP2Video: Mastering Video-Text Retrieval via I...
2021-06-21
Code
24
LAFF
65.8
No
Lightweight Attentional Feature Fusion: A New Ba...
2021-12-03
Code
25
TACo
64
Yes
TACo: Token-aware Cascade Contrastive Learning f...
2021-08-23
-
26
UniVL
63.1
Yes
UniVL: A Unified Video and Language Pre-Training...
2020-02-15
Code
27
MDMMT
61.8
Yes
MDMMT: Multidomain Multimodal Transformer for Vi...
2021-03-19
Code
28
CoCa (zero-shot)
61.6
Yes
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
29
Text-Video Embedding
52.8
No
HowTo100M: Learning a Text-Video Embedding by Wa...
2019-06-07
Code
30
CLIP
50.4
No
A Straightforward Framework For Video Retrieval ...
2021-02-24
Code
31
JSFusion
43.2
No
A Joint Sequence Fusion Model for Video Question...
2018-08-07
Code
32
RoME
41.2
No
RoME: Role-aware Mixture-of-Expert Transformer f...
2022-06-26
Code
33
Collaborative Experts
41.2
No
Use What You Have: Video Retrieval Using Represe...
2019-07-31
Code
34
JEMC
29.7
No
-
-
Code
35
Kaufman
24.1
No
Temporal Tessellation: A Unified Approach for Vi...
2016-12-21
Code
36
C+LSTM+SA+FC7
19.9
No
Learning Language-Visual Embedding for Movie Und...
2016-09-26
-
#1
VAST
89.6
text-to-video R@10
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#2
VALOR
SOTA
89.6
text-to-video R@10
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#3
GRAM
89.3
text-to-video R@10
· Extra Data
· 2024-12-16
Gramian Multimodal Representation Learning and Alignment
Code
#4
VLAB
87.6
text-to-video R@10
· Extra Data
· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#5
UMT-L (ViT-L/16)
SOTA
87.1
text-to-video R@10
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#6
TEFAL
86.1
text-to-video R@10
· 2023-07-24
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
#7
All-in-one + MELTR
SOTA
84.7
text-to-video R@10
· Extra Data
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#8
OmniVL
SOTA
83.8
text-to-video R@10
· Extra Data
· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#9
UCoFiA
83.5
text-to-video R@10
· 2023-09-18
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Code
#10
Aurora (ours, r=64)
82
text-to-video R@10
No paper
#11
vid-TLDR (UMT-L)
81.6
text-to-video R@10
· Extra Data
· 2024-03-20
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Code
#12
CLIP4Clip-seqTransf
SOTA
81.6
text-to-video R@10
· 2021-04-18
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Code
#13
HD-VILA
78
text-to-video R@10
· 2021-11-19
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Code
#14
VIOLET + MELTR
77.8
text-to-video R@10
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#15
VIOLETv2
75.8
text-to-video R@10
· Extra Data
· 2022-09-04
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Code
#16
FROZEN
SOTA
71.2
text-to-video R@10
· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Code
#17
MDMMT-2
70.8
text-to-video R@10
· Extra Data
· 2022-03-14
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
#18
COTS
70.2
text-to-video R@10
· 2022-04-15
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
#19
CLIP2TV
68.9
text-to-video R@10
· Extra Data
· 2021-11-10
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
#20
CAMoE
68.4
text-to-video R@10
· Extra Data
· 2021-09-09
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Code
#21
UniVL + MELTR
67.6
text-to-video R@10
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#22
VideoCoCa (zero-shot)
67
text-to-video R@10
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#23
CLIP2Video
66.2
text-to-video R@10
· Extra Data
· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Code
#24
LAFF
65.8
text-to-video R@10
· 2021-12-03
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Code
#25
TACo
64
text-to-video R@10
· Extra Data
· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#26
UniVL
SOTA
63.1
text-to-video R@10
· Extra Data
· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Code
#27
MDMMT
61.8
text-to-video R@10
· Extra Data
· 2021-03-19
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Code
#28
CoCa (zero-shot)
61.6
text-to-video R@10
· Extra Data
· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models
Code
#29
Text-Video Embedding
SOTA
52.8
text-to-video R@10
· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Code
#30
CLIP
50.4
text-to-video R@10
· 2021-02-24
A Straightforward Framework For Video Retrieval Using CLIP
Code
#31
JSFusion
SOTA
43.2
text-to-video R@10
· 2018-08-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
Code
#32
RoME
41.2
text-to-video R@10
· 2022-06-26
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
Code
#33
Collaborative Experts
41.2
text-to-video R@10
· 2019-07-31
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
Code
#34
JEMC
29.7
text-to-video R@10
No paper
Code
#35
Kaufman
SOTA
24.1
text-to-video R@10
· 2016-12-21
Temporal Tessellation: A Unified Approach for Video Analysis
Code
#36
C+LSTM+SA+FC7
SOTA
19.9
text-to-video R@10
· 2016-09-26
Learning Language-Visual Embedding for Movie Understanding with Natural-Language