Video on YouCook2

Metric: text-to-video R@10 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video R@10▼	Extra Data	Paper	Date↕	Code
1	VAST	80.8	Yes	VAST: A Vision-Audio-Subtitle-Text Omni-Modality...	2023-05-29	Code
2	VideoCLIP	75	Yes	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code
3	UniVL + MELTR	74.8	No	MELTR: Meta Loss Transformer for Learning to Fin...	2023-03-23	Code
4	MDMMT-2	74.8	Yes	MDMMT-2: Multidomain Multimodal Transformer for ...	2022-03-14	-
5	TACo	72.7	Yes	TACo: Token-aware Cascade Contrastive Learning f...	2021-08-23	-
6	OmniVec	70.8	Yes	OmniVec: Learning robust representations with cr...	2023-11-07	-
7	UniVL	70	Yes	UniVL: A Unified Video and Language Pre-Training...	2020-02-15	Code
8	VLM	69.38	Yes	VLM: Task-agnostic Video-Language Model Pre-trai...	2021-05-20	Code
9	OmniVec (pretrained)	64.2	Yes	OmniVec: Learning robust representations with cr...	2023-11-07	-
10	VideoCLIP (zero-shot)	63.1	Yes	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code
11	VideoCoCa (zero-shot)	55.2	No	VideoCoCa: Video-Text Modeling with Zero-Shot Tr...	2022-12-09	-
12	COOT	52.3	No	COOT: Cooperative Hierarchical Transformer for V...	2020-11-01	Code
13	Text-Video Embedding	35.3	No	HowTo100M: Learning a Text-Video Embedding by Wa...	2019-06-07	Code
14	RoME	25.2	No	RoME: Role-aware Mixture-of-Expert Transformer f...	2022-06-26	Code
15	HGLMM FV CCA	21.6	No	-	-	-
16	Satar et al.	20.8	No	Semantic Role Aware Correlation Transformer for ...	2022-06-26	Code

#1VASTSOTA
80.8
text-to-video R@10· Extra Data· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset Code
#2VideoCLIPSOTA
75
text-to-video R@10· Extra Data· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code
#3UniVL + MELTR
74.8
text-to-video R@10· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Code
#4MDMMT-2
74.8
text-to-video R@10· Extra Data· 2022-03-14
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
#5TACoSOTA
72.7
text-to-video R@10· Extra Data· 2021-08-23
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
#6OmniVec
70.8
text-to-video R@10· Extra Data· 2023-11-07
OmniVec: Learning robust representations with cross modal sharing
#7UniVLSOTA
70
text-to-video R@10· Extra Data· 2020-02-15
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation Code
#8VLM
69.38
text-to-video R@10· Extra Data· 2021-05-20
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Code
#9OmniVec (pretrained)
64.2
text-to-video R@10· Extra Data· 2023-11-07
OmniVec: Learning robust representations with cross modal sharing
#10VideoCLIP (zero-shot)
63.1
text-to-video R@10· Extra Data· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code
#11VideoCoCa (zero-shot)
55.2
text-to-video R@10· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#12COOT
52.3
text-to-video R@10· 2020-11-01
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning Code
#13Text-Video EmbeddingSOTA
35.3
text-to-video R@10· 2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Code
#14RoME
25.2
text-to-video R@10· 2022-06-26
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval Code
#15HGLMM FV CCA
21.6
text-to-video R@10
No paper
#16Satar et al.
20.8
text-to-video R@10· 2022-06-26
Semantic Role Aware Correlation Transformer for Text to Video Retrieval Code