Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video Retrieval
/
MSR-VTT
Video Retrieval on MSR-VTT
Metric: text-to-video R@1 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
text-to-video R@1
▼
Extra Data
Paper
Date
↕
Code
1
GRAM
64
Yes
Gramian Multimodal Representation Learning and A...
2024-12-16
Code
2
VAST
63.9
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
3
InternVideo2-6B
62.8
Yes
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
4
VALOR
59.9
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
5
UMT-L (ViT-L/16)
58.8
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
6
vid-TLDR (UMT-L)
58.1
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
7
COSA
57.9
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
8
InternVideo
55.2
Yes
InternVideo: General Video Foundation Models via...
2022-12-06
Code
9
VLAB
55.1
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
10
Aurora (ours, r=64)
52.4
No
-
-
-
11
TEFAL
52
No
Audio-Enhanced Text-to-Video Retrieval using Tex...
2023-07-24
-
12
UCoFiA
49.4
No
Unified Coarse-to-Fine Alignment for Video-Text ...
2023-09-18
Code
13
OmniVL
47.8
Yes
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
14
CLIP4Clip-seqTransf
44.5
No
CLIP4Clip: An Empirical Study of CLIP for End to...
2021-04-18
Code
15
All-in-one + MELTR
38.6
Yes
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
16
VIOLETv2
37.2
Yes
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
17
HD-VILA
35.6
No
Advancing High-Resolution Video-Language Represe...
2021-11-19
Code
18
VideoCoCa (zero-shot)
34.3
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
19
MDMMT-2
33.7
Yes
MDMMT-2: Multidomain Multimodal Transformer for ...
2022-03-14
-
20
VIOLET + MELTR
33.6
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
21
CLIP2TV
33.1
Yes
CLIP2TV: Align, Match and Distill for Video-Text...
2021-11-10
-
22
CAMoE
32.9
Yes
Improving Video-Text Retrieval by Multi-Stream C...
2021-09-09
Code
23
FROZEN
32.5
No
Frozen in Time: A Joint Video and Image Encoder ...
2021-04-01
Code
24
COTS
32.1
No
COTS: Collaborative Two-Stream Vision-Language P...
2022-04-15
-
25
CoCa (zero-shot)
30
Yes
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
26
CLIP2Video
29.8
Yes
CLIP2Video: Mastering Video-Text Retrieval via I...
2021-06-21
Code
27
LAFF
29.1
No
Lightweight Attentional Feature Fusion: A New Ba...
2021-12-03
Code
28
UniVL + MELTR
28.5
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
29
Ours
26
No
Video and Text Matching with Conditioned Embeddi...
2021-10-21
Code
30
TACo
24.8
Yes
TACo: Token-aware Cascade Contrastive Learning f...
2021-08-23
-
31
MDMMT
23.1
Yes
MDMMT: Multidomain Multimodal Transformer for Vi...
2021-03-19
Code
32
CLIP
21.4
No
A Straightforward Framework For Video Retrieval ...
2021-02-24
Code
33
UniVL
21.2
Yes
UniVL: A Unified Video and Language Pre-Training...
2020-02-15
Code
34
Text-Video Embedding
14.9
No
HowTo100M: Learning a Text-Video Embedding by Wa...
2019-06-07
Code
35
RoME
10.7
No
RoME: Role-aware Mixture-of-Expert Transformer f...
2022-06-26
Code
36
JSFusion
10.2
No
A Joint Sequence Fusion Model for Video Question...
2018-08-07
Code
37
Collaborative Experts
10
No
Use What You Have: Video Retrieval Using Represe...
2019-07-31
Code
38
JEMC
7
No
-
-
Code
39
Kaufman
4.7
No
Temporal Tessellation: A Unified Approach for Vi...
2016-12-21
Code
40
C+LSTM+SA+FC7
4.2
No
Learning Language-Visual Embedding for Movie Und...
2016-09-26
-