Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video
/
MSR-VTT
Video on MSR-VTT
Metric: text-to-video R@10 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
text-to-video R@10
▼
Extra Data
Paper
Date
↕
Code
1
VAST
89.6
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
2
VALOR
89.6
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
3
GRAM
89.3
Yes
Gramian Multimodal Representation Learning and A...
2024-12-16
Code
4
VLAB
87.6
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
5
UMT-L (ViT-L/16)
87.1
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
6
TEFAL
86.1
No
Audio-Enhanced Text-to-Video Retrieval using Tex...
2023-07-24
-
7
All-in-one + MELTR
84.7
Yes
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
8
OmniVL
83.8
Yes
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
9
UCoFiA
83.5
No
Unified Coarse-to-Fine Alignment for Video-Text ...
2023-09-18
Code
10
Aurora (ours, r=64)
82
No
-
-
-
11
vid-TLDR (UMT-L)
81.6
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
12
CLIP4Clip-seqTransf
81.6
No
CLIP4Clip: An Empirical Study of CLIP for End to...
2021-04-18
Code
13
HD-VILA
78
No
Advancing High-Resolution Video-Language Represe...
2021-11-19
Code
14
VIOLET + MELTR
77.8
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
15
VIOLETv2
75.8
Yes
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
16
FROZEN
71.2
No
Frozen in Time: A Joint Video and Image Encoder ...
2021-04-01
Code
17
MDMMT-2
70.8
Yes
MDMMT-2: Multidomain Multimodal Transformer for ...
2022-03-14
-
18
COTS
70.2
No
COTS: Collaborative Two-Stream Vision-Language P...
2022-04-15
-
19
CLIP2TV
68.9
Yes
CLIP2TV: Align, Match and Distill for Video-Text...
2021-11-10
-
20
CAMoE
68.4
Yes
Improving Video-Text Retrieval by Multi-Stream C...
2021-09-09
Code
21
UniVL + MELTR
67.6
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
22
VideoCoCa (zero-shot)
67
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
23
CLIP2Video
66.2
Yes
CLIP2Video: Mastering Video-Text Retrieval via I...
2021-06-21
Code
24
LAFF
65.8
No
Lightweight Attentional Feature Fusion: A New Ba...
2021-12-03
Code
25
TACo
64
Yes
TACo: Token-aware Cascade Contrastive Learning f...
2021-08-23
-
26
UniVL
63.1
Yes
UniVL: A Unified Video and Language Pre-Training...
2020-02-15
Code
27
MDMMT
61.8
Yes
MDMMT: Multidomain Multimodal Transformer for Vi...
2021-03-19
Code
28
CoCa (zero-shot)
61.6
Yes
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
29
Text-Video Embedding
52.8
No
HowTo100M: Learning a Text-Video Embedding by Wa...
2019-06-07
Code
30
CLIP
50.4
No
A Straightforward Framework For Video Retrieval ...
2021-02-24
Code
31
JSFusion
43.2
No
A Joint Sequence Fusion Model for Video Question...
2018-08-07
Code
32
RoME
41.2
No
RoME: Role-aware Mixture-of-Expert Transformer f...
2022-06-26
Code
33
Collaborative Experts
41.2
No
Use What You Have: Video Retrieval Using Represe...
2019-07-31
Code
34
JEMC
29.7
No
-
-
Code
35
Kaufman
24.1
No
Temporal Tessellation: A Unified Approach for Vi...
2016-12-21
Code
36
C+LSTM+SA+FC7
19.9
No
Learning Language-Visual Embedding for Movie Und...
2016-09-26
-