TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/MSR-VTT-1kA

Video on MSR-VTT-1kA

Metric: text-to-video R@1 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video R@1▼Extra DataPaperDate↕Code
1HunYuan_tvr (huge)62.9YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
2CLIP-ViP57.7YesCLIP-ViP: Adapting Pre-trained Image-Text Model ...2022-09-14Code
3PIDRo55.9No---
4DMAE (ViT-B/16)55.5NoDual-Modal Attention-Enhanced Text-Video Retriev...2023-09-20Code
5HunYuan_tvr55YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
6MuLTI54.7NoMuLTI: Efficient Video-and-Language Understandin...2023-03-10-
7STAN54.1YesRevisiting Temporal Modeling for CLIP-based Imag...2023-01-26Code
8EERCF54.1NoTowards Efficient and Effective Text-to-Video Re...2024-01-01Code
9TS2-Net54NoTS2-Net: Token Shift and Selection Transformer f...2022-07-16Code
10RTQ53.4NoRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
11DRL53.3YesDisentangled Representation Learning for Text-Vi...2022-03-14Code
12mPLUG-253.1NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
13CLIP2TV52.9YesCLIP2TV: Align, Match and Distill for Video-Text...2021-11-10-
14Side4Video52.3NoSide4Video: Spatial-Temporal Side Network for Me...2023-11-27Code
15EMCL-Net++51.6NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
16Cap4Video51.4NoCap4Video: What Can Auxiliary Captions Do for Te...2022-12-31Code
17SuMA (ViT-B/16)49.8NoVideo-Text Retrieval by Supervised Sparse Multi-...2023-02-19Code
18X2-VLM (large)49.6NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
19UCoFiA49.4NoUnified Coarse-to-Fine Alignment for Video-Text ...2023-09-18Code
20X-CLIP49.3NoX-CLIP: End-to-End Multi-grained Contrastive Lea...2022-07-15Code
21DiffusionRet49NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
22DiffusionRet+QB-Norm48.9NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
23CAMoE48.8YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
24HBI48.6NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
25PAU48.5NoPrototype-based Aleatoric Uncertainty Quantifica...2023-09-29Code
26CenterCLIP (ViT-B/16)48.4YesCenterCLIP: Token Clustering for Efficient Text-...2022-05-02Code
27TeachCLIP (ViT-B/16)48No--Code
28X2-VLM (base)47.6NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
29QB-Norm+CLIP2Video47.2YesCross Modal Retrieval with Querybank Normalisation2021-12-23Code
30X-Pool46.9YesX-Pool: Cross-Modal Language-Video Attention for...2022-03-28Code
31TeachCLIP46.8No--Code
32EMCL-Net46.8NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
33HiTeA46.8NoHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
34VindLU46.5YesVindLU: A Recipe for Effective Video-and-Languag...2022-12-09Code
35LAFF45.8NoLightweight Attentional Feature Fusion: A New Ba...2021-12-03Code
36CLIP2Video45.6YesCLIP2Video: Mastering Video-Text Retrieval via I...2021-06-21Code
37Singularity41.5YesRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
38All-in-one + MELTR41.3NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
39Clover40.5NoClover: Towards A Unified Video-Language Alignme...2022-07-16Code
40MDMMT38.9YesMDMMT: Multidomain Multimodal Transformer for Vi...2021-03-19Code
41MAC38.9YesMasked Contrastive Pre-Training for Efficient Vi...2022-12-02-
42All-in-one-B37.9YesAll in One: Exploring Unified Video-Language Pre...2022-03-14Code
43BridgeFormer37.6YesBridging Video-text Retrieval with Multiple Choi...2022-01-13Code
44Florence37.6YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
45COTS36.8YesCOTS: Collaborative Two-Stream Vision-Language P...2022-04-15-
46VIOLET + MELTR35.5NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
47CLIP31.2YesA Straightforward Framework For Video Retrieval ...2021-02-24Code
48UniVL + MELTR31.1NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
49FROZEN31YesFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
50VideoCLIP30.9YesVideoCLIP: Contrastive Pre-training for Zero-sho...2021-09-28Code
51TACo28.4NoTACo: Token-aware Cascade Contrastive Learning f...2021-08-23-
52VLM28.1YesVLM: Task-agnostic Video-Language Model Pre-trai...2021-05-20Code
53MMT-Pretrained26.6YesMulti-modal Transformer for Video Retrieval2020-07-21Code
54BridgeFormer (Zero-shot)26NoBridging Video-text Retrieval with Multiple Choi...2022-01-13Code
55MMT24.6NoMulti-modal Transformer for Video Retrieval2020-07-21Code
56Collaborative Experts20.9YesUse What You Have: Video Retrieval Using Represe...2019-07-31Code
57HT-Pretrained14.9NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
58HT12.1NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
59JSFusion10.2NoA Joint Sequence Fusion Model for Video Question...2018-08-07Code