Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video
/
YouCook2
Video on YouCook2
Metric: text-to-video R@10 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
text-to-video R@10 (best first)
text-to-video R@10 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
text-to-video R@10
▼
Extra Data
Paper
Date
↕
Code
1
VAST
80.8
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
2
VideoCLIP
75
Yes
VideoCLIP: Contrastive Pre-training for Zero-sho...
2021-09-28
Code
3
UniVL + MELTR
74.8
No
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
4
MDMMT-2
74.8
Yes
MDMMT-2: Multidomain Multimodal Transformer for ...
2022-03-14
-
5
TACo
72.7
Yes
TACo: Token-aware Cascade Contrastive Learning f...
2021-08-23
-
6
OmniVec
70.8
Yes
OmniVec: Learning robust representations with cr...
2023-11-07
-
7
UniVL
70
Yes
UniVL: A Unified Video and Language Pre-Training...
2020-02-15
Code
8
VLM
69.38
Yes
VLM: Task-agnostic Video-Language Model Pre-trai...
2021-05-20
Code
9
OmniVec (pretrained)
64.2
Yes
OmniVec: Learning robust representations with cr...
2023-11-07
-
10
VideoCLIP (zero-shot)
63.1
Yes
VideoCLIP: Contrastive Pre-training for Zero-sho...
2021-09-28
Code
11
VideoCoCa (zero-shot)
55.2
No
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
12
COOT
52.3
No
COOT: Cooperative Hierarchical Transformer for V...
2020-11-01
Code
13
Text-Video Embedding
35.3
No
HowTo100M: Learning a Text-Video Embedding by Wa...
2019-06-07
Code
14
RoME
25.2
No
RoME: Role-aware Mixture-of-Expert Transformer f...
2022-06-26
Code
15
HGLMM FV CCA
21.6
No
-
-
-
16
Satar et al.
20.8
No
Semantic Role Aware Correlation Transformer for ...
2022-06-26
Code
#1
VAST
SOTA
80.8
text-to-video R@10
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#2
VideoCLIP
SOTA
75
text-to-video R@10
· Extra Data
· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Code
#3
UniVL + MELTR
74.8
text-to-video R@10
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#4
MDMMT-2
74.8
text-to-video R@10
· Extra Data
· 2022-03-14
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
#5
TACo
SOTA
72.7
text-to-video R@10
· Extra Data
· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#6
OmniVec
70.8
text-to-video R@10
· Extra Data
· 2023-11-07
OmniVec: Learning robust representations with cross modal sharing
#7
UniVL
SOTA
70
text-to-video R@10
· Extra Data
· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Code
#8
VLM
69.38
text-to-video R@10
· Extra Data
· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Code
#9
OmniVec (pretrained)
64.2
text-to-video R@10
· Extra Data
· 2023-11-07
OmniVec: Learning robust representations with cross modal sharing
#10
VideoCLIP (zero-shot)
63.1
text-to-video R@10
· Extra Data
· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Code
#11
VideoCoCa (zero-shot)
55.2
text-to-video R@10
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#12
COOT
52.3
text-to-video R@10
· 2020-11-01
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Code
#13
Text-Video Embedding
SOTA
35.3
text-to-video R@10
· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Code
#14
RoME
25.2
text-to-video R@10
· 2022-06-26
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
Code
#15
HGLMM FV CCA
21.6
text-to-video R@10
No paper
#16
Satar et al.
20.8
text-to-video R@10
· 2022-06-26
Semantic Role Aware Correlation Transformer for Text to Video Retrieval
Code