TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video Retrieval/MSR-VTT-1kA

Video Retrieval on MSR-VTT-1kA

Metric: text-to-video R@10 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video R@10▼Extra DataPaperDate↕Code
1HunYuan_tvr (huge)90.8YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
2OmniVec89.4YesOmniVec: Learning robust representations with cr...2023-11-07-
3CLIP-ViP88.2YesCLIP-ViP: Adapting Pre-trained Image-Text Model ...2022-09-14Code
4STAN87.8YesRevisiting Temporal Modeling for CLIP-based Imag...2023-01-26Code
5PIDRo87.6No---
6DRL87.6YesDisentangled Representation Learning for Text-Vi...2022-03-14Code
7TS2-Net87.4NoTS2-Net: Token Shift and Selection Transformer f...2022-07-16Code
8DMAE (ViT-B/16)87.1NoDual-Modal Attention-Enhanced Text-Video Retriev...2023-09-20Code
9EERCF86.9NoTowards Efficient and Effective Text-to-Video Re...2024-01-01Code
10CLIP2TV86.5YesCLIP2TV: Align, Match and Distill for Video-Text...2021-11-10-
11MuLTI86NoMuLTI: Efficient Video-and-Language Understandin...2023-03-10-
12EMCL-Net++85.3NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
13CAMoE85.3YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
14X-CLIP84.8NoX-CLIP: End-to-End Multi-grained Contrastive Lea...2022-07-15Code
15mPLUG-284.7NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
16RTQ84.4NoRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
17Side4Video84.2NoSide4Video: Spatial-Temporal Side Network for Me...2023-11-27Code
18X2-VLM (large)84.2NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
19X2-VLM (base)84.2NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
20Cap4Video83.9NoCap4Video: What Can Auxiliary Captions Do for Te...2022-12-31Code
21SuMA (ViT-B/16)83.9NoVideo-Text Retrieval by Supervised Sparse Multi-...2023-02-19Code
22UCoFiA83.5NoUnified Coarse-to-Fine Alignment for Video-Text ...2023-09-18Code
23TeachCLIP (ViT-B/16)83.5No--Code
24HBI83.4NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
25DiffusionRet+QB-Norm83.1NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
26EMCL-Net83.1NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
27QB-Norm+CLIP2Video83YesCross Modal Retrieval with Querybank Normalisation2021-12-23Code
28DiffusionRet82.7NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
29TeachCLIP82.6No--Code
30PAU82.5NoPrototype-based Aleatoric Uncertainty Quantifica...2023-09-29Code
31All-in-one + MELTR82.5NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
32X-Pool82.2YesX-Pool: Cross-Modal Language-Video Attention for...2022-03-28Code
33CenterCLIP (ViT-B/16)82YesCenterCLIP: Token Clustering for Efficient Text-...2022-05-02Code
34LAFF82NoLightweight Attentional Feature Fusion: A New Ba...2021-12-03Code
35HiTeA81.9NoHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
36CLIP2Video81.7YesCLIP2Video: Mastering Video-Text Retrieval via I...2021-06-21Code
37CLIP4Clip81.6YesCLIP4Clip: An Empirical Study of CLIP for End to...2021-04-18Code
38VindLU80.4YesVindLU: A Recipe for Effective Video-and-Languag...2022-12-09Code
39MDMMT79.7YesMDMMT: Multidomain Multimodal Transformer for Vi...2021-03-19Code
40Clover79.4NoClover: Towards A Unified Video-Language Alignme...2022-07-16Code
41OmniVec (pretrained)78.6YesOmniVec: Learning robust representations with cr...2023-11-07-
42VIOLET + MELTR78.4NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
43All-in-one-B77.1YesAll in One: Exploring Unified Video-Language Pre...2022-03-14Code
44Singularity77YesRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
45BridgeFormer75.1YesBridging Video-text Retrieval with Multiple Choi...2022-01-13Code
46MAC73.9YesMasked Contrastive Pre-Training for Efficient Vi...2022-12-02-
47COTS73.2YesCOTS: Collaborative Two-Stream Vision-Language P...2022-04-15-
48Florence72.6YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
49TACo71.2NoTACo: Token-aware Cascade Contrastive Learning f...2021-08-23-
50FROZEN70.5YesFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
51MMT-Pretrained69.6YesMulti-modal Transformer for Video Retrieval2020-07-21Code
52UniVL + MELTR68.3NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
53VLM67.4YesVLM: Task-agnostic Video-Language Model Pre-trai...2021-05-20Code
54MMT67.1NoMulti-modal Transformer for Video Retrieval2020-07-21Code
55VideoCLIP66.8YesVideoCLIP: Contrastive Pre-training for Zero-sho...2021-09-28Code
56CLIP64.2YesA Straightforward Framework For Video Retrieval ...2021-02-24Code
57Collaborative Experts62.4YesUse What You Have: Video Retrieval Using Represe...2019-07-31Code
58BridgeFormer (Zero-shot)56.4NoBridging Video-text Retrieval with Multiple Choi...2022-01-13Code
59HT-Pretrained52.8NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
60HT48NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
61JSFusion43.2NoA Joint Sequence Fusion Model for Video Question...2018-08-07Code