TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/DiDeMo

Video on DiDeMo

Metric: text-to-video R@1 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video R@1▼Extra DataPaperDate↕Code
1InternVideo2-6B74.2YesInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code
2vid-TLDR (UMT-L)72.3Yesvid-TLDR: Training Free Token merging for Light-...2024-03-20Code
3VAST72YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
4COSA70.5YesCOSA: Concatenated Sample Pretrained Vision-Lang...2023-06-15Code
5UMT-L (ViT-L/16)70.4YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
6GRAM67.3YesGramian Multimodal Representation Learning and A...2024-12-16Code
7VALOR61.5YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
8TESTA (ViT-B/16)61.2YesTESTA: Temporal-Spatial Token Aggregation for Lo...2023-10-29Code
9VindLU61.2YesVindLU: A Recipe for Effective Video-and-Languag...2022-12-09Code
10InternVideo57.9YesInternVideo: General Video Foundation Models via...2022-12-06Code
11RTQ57.6NoRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
12VLAB56.8YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
13HiTeA56.5YesHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
14MuLTI56.5YesMuLTI: Efficient Video-and-Language Understandin...2023-03-10-
15mPLUG-256.4YesmPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
16CLIP-ViP55.3YesCLIP-ViP: Adapting Pre-trained Image-Text Model ...2022-09-14Code
17STAN54.6YesRevisiting Temporal Modeling for CLIP-based Imag...2023-01-26Code
18Singularity53.9YesRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
19DMAE (ViT-B/32)52.7NoDual-Modal Attention-Enhanced Text-Video Retriev...2023-09-20Code
20HunYuan_tvr (huge)52.7YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
21OmniVL52.4YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
22HunYuan_tvr52.1YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
23Cap4Video52NoCap4Video: What Can Auxiliary Captions Do for Te...2022-12-31Code
24Clover50.1NoClover: Towards A Unified Video-Language Alignme...2022-07-16Code
25DRL49YesDisentangled Representation Learning for Text-Vi...2022-03-14Code
26DiffusionRet+QB-Norm48.9NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
27PAU48.6NoPrototype-based Aleatoric Uncertainty Quantifica...2023-09-29Code
28VIOLETv247.9NoAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
29X-CLIP47.8NoX-CLIP: End-to-End Multi-grained Contrastive Lea...2022-07-15Code
30HBI46.9NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
31DiffusionRet46.7NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
32CAMoE43.8YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
33QB-Norm+CLIP4Clip43.5YesCross Modal Retrieval with Querybank Normalisation2021-12-23Code
34CLIP4Clip43.4YesCLIP4Clip: An Empirical Study of CLIP for End to...2021-04-18Code
35ALPRO35.9YesAlign and Prompt: Video-and-Language Pre-trainin...2021-12-17Code
36FROZEN31YesFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
37HD-VILA28.8NoAdvancing High-Resolution Video-Language Represe...2021-11-19Code
38PO Loss16.3NoRudder: A Cross Lingual Video and Text Retrieval...2021-03-09Code
39Collaborative Experts16.1NoUse What You Have: Video Retrieval Using Represe...2019-07-31Code