TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/DiDeMo

Video on DiDeMo

Metric: text-to-video R@5 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video R@5▼Extra DataPaperDate↕Code
1vid-TLDR (UMT-L)91.2Yesvid-TLDR: Training Free Token merging for Light-...2024-03-20Code
2UMT-L (ViT-L/16)90.1YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
3VAST89YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
4TESTA (ViT-B/16)87.2YesTESTA: Temporal-Spatial Token Aggregation for Lo...2023-10-29Code
5VindLU85.8YesVindLU: A Recipe for Effective Video-and-Languag...2022-12-09Code
6VALOR85.3YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
7RTQ84.1NoRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
8CLIP-ViP82YesCLIP-ViP: Adapting Pre-trained Image-Text Model ...2022-09-14Code
9HiTeA81.7YesHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
10VLAB81.6YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
11MuLTI80.2YesMuLTI: Efficient Video-and-Language Understandin...2023-03-10-
12OmniVL79.5YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
13Singularity79.4YesRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
14Cap4Video79.4NoCap4Video: What Can Auxiliary Captions Do for Te...2022-12-31Code
15DMAE (ViT-B/32)79.3NoDual-Modal Attention-Enhanced Text-Video Retriev...2023-09-20Code
16X-CLIP79.3NoX-CLIP: End-to-End Multi-grained Contrastive Lea...2022-07-15Code
17mPLUG-279.1YesmPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
18STAN78.4YesRevisiting Temporal Modeling for CLIP-based Imag...2023-01-26Code
19HunYuan_tvr78.2YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
20HunYuan_tvr (huge)77.8YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
21Aurora (ours, r=64)77.4No---
22Clover76.7NoClover: Towards A Unified Video-Language Alignme...2022-07-16Code
23DRL76.5YesDisentangled Representation Learning for Text-Vi...2022-03-14Code
24VIOLETv276.5NoAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
25PAU76NoPrototype-based Aleatoric Uncertainty Quantifica...2023-09-29Code
26DiffusionRet+QB-Norm75.5NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
27HBI74.9NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
28DiffusionRet74.7NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
29CAMoE71.4YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
30QB-Norm+CLIP4Clip71.4YesCross Modal Retrieval with Querybank Normalisation2021-12-23Code
31CLIP4Clip70.2YesCLIP4Clip: An Empirical Study of CLIP for End to...2021-04-18Code
32ALPRO67.5YesAlign and Prompt: Video-and-Language Pre-trainin...2021-12-17Code
33FROZEN59.8YesFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
34HD-VILA57.4NoAdvancing High-Resolution Video-Language Represe...2021-11-19Code
35Collaborative Experts41.1NoUse What You Have: Video Retrieval Using Represe...2019-07-31Code