Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video
/
MSR-VTT
Video on MSR-VTT
Metric: text-to-video R@1 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
text-to-video R@1 (best first)
text-to-video R@1 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
text-to-video R@1
▼
Extra Data
Paper
Date
↕
Code
1
GRAM
64
Yes
Gramian Multimodal Representation Learning and A...
2024-12-16
Code
2
VAST
63.9
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
3
InternVideo2-6B
62.8
Yes
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
4
VALOR
59.9
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
5
UMT-L (ViT-L/16)
58.8
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
6
vid-TLDR (UMT-L)
58.1
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
7
COSA
57.9
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
8
InternVideo
55.2
Yes
InternVideo: General Video Foundation Models via...
2022-12-06
Code
9
VLAB
55.1
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
10
Aurora (ours, r=64)
52.4
No
-
-
-
11
TEFAL
52
No
Audio-Enhanced Text-to-Video Retrieval using Tex...
2023-07-24
-
12
UCoFiA
49.4
No
Unified Coarse-to-Fine Alignment for Video-Text ...
2023-09-18
Code
13
OmniVL
47.8
Yes
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
14
CLIP4Clip-seqTransf
44.5
No
CLIP4Clip: An Empirical Study of CLIP for End to...
2021-04-18
Code
15
All-in-one + MELTR
38.6
Yes
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
16
VIOLETv2
37.2
Yes
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
17
HD-VILA
35.6
No
Advancing High-Resolution Video-Language Represe...
2021-11-19
Code
18
VideoCoCa (zero-shot)
34.3
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
19
MDMMT-2
33.7
Yes
MDMMT-2: Multidomain Multimodal Transformer for ...
2022-03-14
-
20
VIOLET + MELTR
33.6
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
21
CLIP2TV
33.1
Yes
CLIP2TV: Align, Match and Distill for Video-Text...
2021-11-10
-
22
CAMoE
32.9
Yes
Improving Video-Text Retrieval by Multi-Stream C...
2021-09-09
Code
23
FROZEN
32.5
No
Frozen in Time: A Joint Video and Image Encoder ...
2021-04-01
Code
24
COTS
32.1
No
COTS: Collaborative Two-Stream Vision-Language P...
2022-04-15
-
25
CoCa (zero-shot)
30
Yes
CoCa: Contrastive Captioners are Image-Text Foun...
2022-05-04
Code
26
CLIP2Video
29.8
Yes
CLIP2Video: Mastering Video-Text Retrieval via I...
2021-06-21
Code
27
LAFF
29.1
No
Lightweight Attentional Feature Fusion: A New Ba...
2021-12-03
Code
28
UniVL + MELTR
28.5
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
29
Ours
26
No
Video and Text Matching with Conditioned Embeddi...
2021-10-21
Code
30
TACo
24.8
Yes
TACo: Token-aware Cascade Contrastive Learning f...
2021-08-23
-
31
MDMMT
23.1
Yes
MDMMT: Multidomain Multimodal Transformer for Vi...
2021-03-19
Code
32
CLIP
21.4
No
A Straightforward Framework For Video Retrieval ...
2021-02-24
Code
33
UniVL
21.2
Yes
UniVL: A Unified Video and Language Pre-Training...
2020-02-15
Code
34
Text-Video Embedding
14.9
No
HowTo100M: Learning a Text-Video Embedding by Wa...
2019-06-07
Code
35
RoME
10.7
No
RoME: Role-aware Mixture-of-Expert Transformer f...
2022-06-26
Code
36
JSFusion
10.2
No
A Joint Sequence Fusion Model for Video Question...
2018-08-07
Code
37
Collaborative Experts
10
No
Use What You Have: Video Retrieval Using Represe...
2019-07-31
Code
38
JEMC
7
No
-
-
Code
39
Kaufman
4.7
No
Temporal Tessellation: A Unified Approach for Vi...
2016-12-21
Code
40
C+LSTM+SA+FC7
4.2
No
Learning Language-Visual Embedding for Movie Und...
2016-09-26
-
#1
GRAM
SOTA
64
text-to-video R@1
· Extra Data
· 2024-12-16
Gramian Multimodal Representation Learning and Alignment
Code
#2
VAST
SOTA
63.9
text-to-video R@1
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#3
InternVideo2-6B
62.8
text-to-video R@1
· Extra Data
· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Code
#4
VALOR
SOTA
59.9
text-to-video R@1
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#5
UMT-L (ViT-L/16)
SOTA
58.8
text-to-video R@1
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#6
vid-TLDR (UMT-L)
58.1
text-to-video R@1
· Extra Data
· 2024-03-20
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Code
#7
COSA
57.9
text-to-video R@1
· Extra Data
· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Code
#8
InternVideo
SOTA
55.2
text-to-video R@1
· Extra Data
· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Code
#9
VLAB
55.1
text-to-video R@1
· Extra Data
· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#10
Aurora (ours, r=64)
52.4
text-to-video R@1
No paper
#11
TEFAL
52
text-to-video R@1
· 2023-07-24
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
#12
UCoFiA
49.4
text-to-video R@1
· 2023-09-18
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Code
#13
OmniVL
SOTA
47.8
text-to-video R@1
· Extra Data
· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#14
CLIP4Clip-seqTransf
SOTA
44.5
text-to-video R@1
· 2021-04-18
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Code
#15
All-in-one + MELTR
38.6
text-to-video R@1
· Extra Data
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#16
VIOLETv2
37.2
text-to-video R@1
· Extra Data
· 2022-09-04
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Code
#17
HD-VILA
35.6
text-to-video R@1
· 2021-11-19
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Code
#18
VideoCoCa (zero-shot)
34.3
text-to-video R@1
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#19
MDMMT-2
33.7
text-to-video R@1
· Extra Data
· 2022-03-14
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
#20
VIOLET + MELTR
33.6
text-to-video R@1
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#21
CLIP2TV
33.1
text-to-video R@1
· Extra Data
· 2021-11-10
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
#22
CAMoE
32.9
text-to-video R@1
· Extra Data
· 2021-09-09
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Code
#23
FROZEN
SOTA
32.5
text-to-video R@1
· 2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Code
#24
COTS
32.1
text-to-video R@1
· 2022-04-15
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
#25
CoCa (zero-shot)
30
text-to-video R@1
· Extra Data
· 2022-05-04
CoCa: Contrastive Captioners are Image-Text Foundation Models
Code
#26
CLIP2Video
29.8
text-to-video R@1
· Extra Data
· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Code
#27
LAFF
29.1
text-to-video R@1
· 2021-12-03
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Code
#28
UniVL + MELTR
28.5
text-to-video R@1
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#29
Ours
26
text-to-video R@1
· 2021-10-21
Video and Text Matching with Conditioned Embeddings
Code
#30
TACo
24.8
text-to-video R@1
· Extra Data
· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#31
MDMMT
SOTA
23.1
text-to-video R@1
· Extra Data
· 2021-03-19
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Code
#32
CLIP
SOTA
21.4
text-to-video R@1
· 2021-02-24
A Straightforward Framework For Video Retrieval Using CLIP
Code
#33
UniVL
SOTA
21.2
text-to-video R@1
· Extra Data
· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Code
#34
Text-Video Embedding
SOTA
14.9
text-to-video R@1
· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Code
#35
RoME
10.7
text-to-video R@1
· 2022-06-26
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
Code
#36
JSFusion
SOTA
10.2
text-to-video R@1
· 2018-08-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
Code
#37
Collaborative Experts
10
text-to-video R@1
· 2019-07-31
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
Code
#38
JEMC
7
text-to-video R@1
No paper
Code
#39
Kaufman
SOTA
4.7
text-to-video R@1
· 2016-12-21
Temporal Tessellation: A Unified Approach for Video Analysis
Code
#40
C+LSTM+SA+FC7
SOTA
4.2
text-to-video R@1
· 2016-09-26
Learning Language-Visual Embedding for Movie Understanding with Natural-Language