TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/DiDeMo

Video on DiDeMo

Metric: text-to-video R@10 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video R@10▼Extra DataPaperDate↕Code
1vid-TLDR (UMT-L)94.2Yesvid-TLDR: Training Free Token merging for Light-...2024-03-20Code
2UMT-L (ViT-L/16)93.5YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
3TESTA (ViT-B/16)91.5YesTESTA: Temporal-Spatial Token Aggregation for Lo...2023-10-29Code
4VAST91.4YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
5VindLU91YesVindLU: A Recipe for Effective Video-and-Languag...2022-12-09Code
6VALOR90.4YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
7GRAM90.1YesGramian Multimodal Representation Learning and A...2024-12-16Code
8RTQ89.9NoRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
9HiTeA89.7YesHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
10CLIP-ViP89.3YesCLIP-ViP: Adapting Pre-trained Image-Text Model ...2022-09-14Code
11VLAB88.7YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
12Cap4Video87.5NoCap4Video: What Can Auxiliary Captions Do for Te...2022-12-31Code
13MuLTI87YesMuLTI: Efficient Video-and-Language Understandin...2023-03-10-
14Singularity86.9YesRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
15DMAE (ViT-B/32)86.6NoDual-Modal Attention-Enhanced Text-Video Retriev...2023-09-20Code
16HunYuan_tvr85.7YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
17Clover85.6NoClover: Towards A Unified Video-Language Alignme...2022-07-16Code
18OmniVL85.4YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
19Aurora (ours, r=64)85.3No---
20mPLUG-285.2YesmPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
21HunYuan_tvr (huge)85.2YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
22STAN85.1YesRevisiting Temporal Modeling for CLIP-based Imag...2023-01-26Code
23DRL84.5YesDisentangled Representation Learning for Text-Vi...2022-03-14Code
24PAU84.5NoPrototype-based Aleatoric Uncertainty Quantifica...2023-09-29Code
25VIOLETv284.1NoAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
26DiffusionRet+QB-Norm83.3NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
27HBI82.7NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
28DiffusionRet82.7NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
29QB-Norm+CLIP4Clip80.9YesCross Modal Retrieval with Querybank Normalisation2021-12-23Code
30CLIP4Clip80.6YesCLIP4Clip: An Empirical Study of CLIP for End to...2021-04-18Code
31CAMoE79.9YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
32ALPRO78.8YesAlign and Prompt: Video-and-Language Pre-trainin...2021-12-17Code
33FROZEN72.4YesFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
34HD-VILA69.1NoAdvancing High-Resolution Video-Language Represe...2021-11-19Code
35PO Loss56.5NoRudder: A Cross Lingual Video and Text Retrieval...2021-03-09Code
36Collaborative Experts54.4NoUse What You Have: Video Retrieval Using Represe...2019-07-31Code