TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/MSR-VTT-1kA

Video on MSR-VTT-1kA

Metric: text-to-video R@5 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video R@5▼Extra DataPaperDate↕Code
1HunYuan_tvr (huge)84.5YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
2CLIP-ViP80.5YesCLIP-ViP: Adapting Pre-trained Image-Text Model ...2022-09-14Code
3DRL80.3YesDisentangled Representation Learning for Text-Vi...2022-03-14Code
4PIDRo79.8No---
5STAN79.5YesRevisiting Temporal Modeling for CLIP-based Imag...2023-01-26Code
6DMAE (ViT-B/16)79.4NoDual-Modal Attention-Enhanced Text-Video Retriev...2023-09-20Code
7TS2-Net79.3NoTS2-Net: Token Shift and Selection Transformer f...2022-07-16Code
8EERCF78.8NoTowards Efficient and Effective Text-to-Video Re...2024-01-01Code
9CLIP2TV78.5YesCLIP2TV: Align, Match and Distill for Video-Text...2021-11-10-
10EMCL-Net++78.1NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
11MuLTI77.7NoMuLTI: Efficient Video-and-Language Understandin...2023-03-10-
12mPLUG-277.6NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
13X2-VLM (large)76.7NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
14RTQ76.1NoRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
15TeachCLIP (ViT-B/16)75.9No--Code
16X-CLIP75.8NoX-CLIP: End-to-End Multi-grained Contrastive Lea...2022-07-15Code
17Cap4Video75.7NoCap4Video: What Can Auxiliary Captions Do for Te...2022-12-31Code
18CAMoE75.6YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
19Side4Video75.5NoSide4Video: Spatial-Temporal Side Network for Me...2023-11-27Code
20DiffusionRet75.2NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
21DiffusionRet+QB-Norm75.2NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
22SuMA (ViT-B/16)75.1NoVideo-Text Retrieval by Supervised Sparse Multi-...2023-02-19Code
23HBI74.6NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
24TeachCLIP74.3No--Code
25X2-VLM (base)74.1NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
26CenterCLIP (ViT-B/16)73.8YesCenterCLIP: Token Clustering for Efficient Text-...2022-05-02Code
27All-in-one + MELTR73.5NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
28EMCL-Net73.1NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
29QB-Norm+CLIP2Video73YesCross Modal Retrieval with Querybank Normalisation2021-12-23Code
30X-Pool72.8YesX-Pool: Cross-Modal Language-Video Attention for...2022-03-28Code
31PAU72.7NoPrototype-based Aleatoric Uncertainty Quantifica...2023-09-29Code
32CLIP2Video72.6YesCLIP2Video: Mastering Video-Text Retrieval via I...2021-06-21Code
33UCoFiA72.1NoUnified Coarse-to-Fine Alignment for Video-Text ...2023-09-18Code
34VindLU71.5YesVindLU: A Recipe for Effective Video-and-Languag...2022-12-09Code
35LAFF71.5NoLightweight Attentional Feature Fusion: A New Ba...2021-12-03Code
36HiTeA71.2NoHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
37Clover69.8NoClover: Towards A Unified Video-Language Alignme...2022-07-16Code
38MDMMT69YesMDMMT: Multidomain Multimodal Transformer for Vi...2021-03-19Code
39Singularity68.7YesRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
40All-in-one-B68.1YesAll in One: Exploring Unified Video-Language Pre...2022-03-14Code
41VIOLET + MELTR67.2NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
42BridgeFormer64.8YesBridging Video-text Retrieval with Multiple Choi...2022-01-13Code
43Florence63.8YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
44COTS63.8YesCOTS: Collaborative Two-Stream Vision-Language P...2022-04-15-
45MAC63.1YesMasked Contrastive Pre-Training for Efficient Vi...2022-12-02-
46FROZEN59.5YesFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
47TACo57.8NoTACo: Token-aware Cascade Contrastive Learning f...2021-08-23-
48MMT-Pretrained57.1YesMulti-modal Transformer for Video Retrieval2020-07-21Code
49UniVL + MELTR55.7NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
50VLM55.5YesVLM: Task-agnostic Video-Language Model Pre-trai...2021-05-20Code
51VideoCLIP55.4YesVideoCLIP: Contrastive Pre-training for Zero-sho...2021-09-28Code
52MMT54NoMulti-modal Transformer for Video Retrieval2020-07-21Code
53CLIP53.7YesA Straightforward Framework For Video Retrieval ...2021-02-24Code
54Collaborative Experts48.8YesUse What You Have: Video Retrieval Using Represe...2019-07-31Code
55BridgeFormer (Zero-shot)46.4NoBridging Video-text Retrieval with Multiple Choi...2022-01-13Code
56HT-Pretrained40.2NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
57HT35NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
58JSFusion31.2NoA Joint Sequence Fusion Model for Video Question...2018-08-07Code