Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Zero-Shot Video Retrieval
/
ActivityNet
Zero-Shot Video Retrieval on ActivityNet
Metric: text-to-video R@10 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
text-to-video R@10 (best first)
text-to-video R@10 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
text-to-video R@10
▼
Extra Data
Paper
Date
↕
Code
1
InternVideo2-6B
92.5
Yes
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
2
GRAM
91.2
Yes
Gramian Multimodal Representation Learning and A...
2024-12-16
Code
3
InternVideo2-1B
90.8
Yes
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
4
LanguageBind(ViT-H/14)
80
Yes
LanguageBind: Extending Video-Language Pretraini...
2023-10-03
Code
5
UMT-L (ViT-L/16)
79.8
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
6
vid-TLDR (UMT-L)
79.6
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
7
BT-Adapter
78.9
Yes
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
8
LanguageBind(ViT-L/14)
77.9
Yes
LanguageBind: Extending Video-Language Pretraini...
2023-10-03
Code
9
VideoCoCa
76.6
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
10
Singularity-temporal-17M
66.9
Yes
Revealing Single Frame Bias for Video-and-Langua...
2022-06-07
Code
11
Singularity-temporal-5M
66.3
Yes
Revealing Single Frame Bias for Video-and-Langua...
2022-06-07
Code
#1
InternVideo2-6B
SOTA
92.5
text-to-video R@10
· Extra Data
· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Code
#2
GRAM
91.2
text-to-video R@10
· Extra Data
· 2024-12-16
Gramian Multimodal Representation Learning and Alignment
Code
#3
InternVideo2-1B
90.8
text-to-video R@10
· Extra Data
· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Code
#4
LanguageBind(ViT-H/14)
SOTA
80
text-to-video R@10
· Extra Data
· 2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Code
#5
UMT-L (ViT-L/16)
SOTA
79.8
text-to-video R@10
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#6
vid-TLDR (UMT-L)
79.6
text-to-video R@10
· Extra Data
· 2024-03-20
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Code
#7
BT-Adapter
78.9
text-to-video R@10
· Extra Data
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#8
LanguageBind(ViT-L/14)
77.9
text-to-video R@10
· Extra Data
· 2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Code
#9
VideoCoCa
SOTA
76.6
text-to-video R@10
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#10
Singularity-temporal-17M
SOTA
66.9
text-to-video R@10
· Extra Data
· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning
Code
#11
Singularity-temporal-5M
66.3
text-to-video R@10
· Extra Data
· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning
Code