Video Retrieval on VATEX

Metric: text-to-video R@1 (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	text-to-video R@1▼	Extra Data	Paper	Date↕	Code
1	GRAM	87.7	Yes	Gramian Multimodal Representation Learning and A...	2024-12-16	Code
2	VAST	83	Yes	VAST: A Vision-Audio-Subtitle-Text Omni-Modality...	2023-05-29	Code
3	VALOR	78.5	Yes	VALOR: Vision-Audio-Language Omni-Perception Pre...	2023-04-17	Code
4	InternVideo2-6B	75.5	Yes	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
5	Unmasked Teacher	72	No	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
6	InternVideo	71.1	No	InternVideo: General Video Foundation Models via...	2022-12-06	Code
7	Side4Video	68.8	No	Side4Video: Spatial-Temporal Side Network for Me...	2023-11-27	Code
8	Cap4Video	66.6	No	Cap4Video: What Can Auxiliary Captions Do for Te...	2022-12-31	Code
9	TeachCLIP	63.6	No	-	-	Code
10	TS2-Net	59.1	No	TS2-Net: Token Shift and Selection Transformer f...	2022-07-16	Code
11	LAFF	59.1	No	Lightweight Attentional Feature Fusion: A New Ba...	2021-12-03	Code
12	QB-Norm+CLIP2Video	58.8	Yes	Cross Modal Retrieval with Querybank Normalisation	2021-12-23	Code
13	CLIP2Video	57.3	Yes	CLIP2Video: Mastering Video-Text Retrieval via I...	2021-06-21	Code

#1GRAMSOTA
87.7
text-to-video R@1· Extra Data· 2024-12-16
Gramian Multimodal Representation Learning and Alignment Code
#2VASTSOTA
83
text-to-video R@1· Extra Data· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset Code
#3VALORSOTA
78.5
text-to-video R@1· Extra Data· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Code
#4InternVideo2-6B
75.5
text-to-video R@1· Extra Data· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#5Unmasked TeacherSOTA
72
text-to-video R@1· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#6InternVideoSOTA
71.1
text-to-video R@1· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#7Side4Video
68.8
text-to-video R@1· 2023-11-27
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning Code
#8Cap4Video
66.6
text-to-video R@1· 2022-12-31
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Code
#9TeachCLIP
63.6
text-to-video R@1
No paperCode
#10TS2-Net
59.1
text-to-video R@1· 2022-07-16
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval Code
#11LAFFSOTA
59.1
text-to-video R@1· 2021-12-03
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval Code
#12QB-Norm+CLIP2Video
58.8
text-to-video R@1· Extra Data· 2021-12-23
Cross Modal Retrieval with Querybank Normalisation Code
#13CLIP2VideoSOTA
57.3
text-to-video R@1· Extra Data· 2021-06-21
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP Code