Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video
/
VATEX
Video on VATEX
Metric: text-to-video R@1 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
text-to-video R@1 (best first)
text-to-video R@1 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
text-to-video R@1
▼
Extra Data
Paper
Date
↕
Code
1
GRAM
87.7
Yes
Gramian Multimodal Representation Learning and A...
2024-12-16
Code
2
VAST
83
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
3
VALOR
78.5
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
4
InternVideo2-6B
75.5
Yes
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
5
Unmasked Teacher
72
No
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
6
InternVideo
71.1
No
InternVideo: General Video Foundation Models via...
2022-12-06
Code
7
Side4Video
68.8
No
Side4Video: Spatial-Temporal Side Network for Me...
2023-11-27
Code
8
Cap4Video
66.6
No
Cap4Video: What Can Auxiliary Captions Do for Te...
2022-12-31
Code
9
TeachCLIP
63.6
No
-
-
Code
10
TS2-Net
59.1
No
TS2-Net: Token Shift and Selection Transformer f...
2022-07-16
Code
11
LAFF
59.1
No
Lightweight Attentional Feature Fusion: A New Ba...
2021-12-03
Code
12
QB-Norm+CLIP2Video
58.8
Yes
Cross Modal Retrieval with Querybank Normalisation
2021-12-23
Code
13
CLIP2Video
57.3
Yes
CLIP2Video: Mastering Video-Text Retrieval via I...
2021-06-21
Code
#1
GRAM
SOTA
87.7
text-to-video R@1
· Extra Data
· 2024-12-16
Gramian Multimodal Representation Learning and Alignment
Code
#2
VAST
SOTA
83
text-to-video R@1
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#3
VALOR
SOTA
78.5
text-to-video R@1
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#4
InternVideo2-6B
75.5
text-to-video R@1
· Extra Data
· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Code
#5
Unmasked Teacher
SOTA
72
text-to-video R@1
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#6
InternVideo
SOTA
71.1
text-to-video R@1
· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Code
#7
Side4Video
68.8
text-to-video R@1
· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Code
#8
Cap4Video
66.6
text-to-video R@1
· 2022-12-31
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Code
#9
TeachCLIP
63.6
text-to-video R@1
No paper
Code
#10
TS2-Net
59.1
text-to-video R@1
· 2022-07-16
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Code
#11
LAFF
SOTA
59.1
text-to-video R@1
· 2021-12-03
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Code
#12
QB-Norm+CLIP2Video
58.8
text-to-video R@1
· Extra Data
· 2021-12-23
Cross Modal Retrieval with Querybank Normalisation
Code
#13
CLIP2Video
SOTA
57.3
text-to-video R@1
· Extra Data
· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Code