TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video/MSR-VTT-1kA

Video on MSR-VTT-1kA

Metric: text-to-video Median Rank (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video Median Rank▼Extra DataPaperDate↕Code
1JSFusion13NoA Joint Sequence Fusion Model for Video Question...2018-08-07Code
2HT12NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
3HT-Pretrained9NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
4BridgeFormer (Zero-shot)7NoBridging Video-text Retrieval with Multiple Choi...2022-01-13Code
5Collaborative Experts6YesUse What You Have: Video Retrieval Using Represe...2019-07-31Code
6CLIP4YesA Straightforward Framework For Video Retrieval ...2021-02-24Code
7UniVL + MELTR4NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
8TACo4NoTACo: Token-aware Cascade Contrastive Learning f...2021-08-23-
9VLM4YesVLM: Task-agnostic Video-Language Model Pre-trai...2021-05-20Code
10MMT-Pretrained4YesMulti-modal Transformer for Video Retrieval2020-07-21Code
11MMT4NoMulti-modal Transformer for Video Retrieval2020-07-21Code
12MAC3YesMasked Contrastive Pre-Training for Efficient Vi...2022-12-02-
13BridgeFormer3YesBridging Video-text Retrieval with Multiple Choi...2022-01-13Code
14VIOLET + MELTR3NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
15FROZEN3YesFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
16X-CLIP2NoX-CLIP: End-to-End Multi-grained Contrastive Lea...2022-07-15Code
17DiffusionRet2NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
18DiffusionRet+QB-Norm2NoDiffusionRet: Generative Text-Video Retrieval wi...2023-03-17Code
19CAMoE2YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
20HBI2NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
21PAU2NoPrototype-based Aleatoric Uncertainty Quantifica...2023-09-29Code
22CenterCLIP (ViT-B/16)2YesCenterCLIP: Token Clustering for Efficient Text-...2022-05-02Code
23QB-Norm+CLIP2Video2YesCross Modal Retrieval with Querybank Normalisation2021-12-23Code
24X-Pool2YesX-Pool: Cross-Modal Language-Video Attention for...2022-03-28Code
25CLIP2Video2YesCLIP2Video: Mastering Video-Text Retrieval via I...2021-06-21Code
26Clover2NoClover: Towards A Unified Video-Language Alignme...2022-07-16Code
27MDMMT2YesMDMMT: Multidomain Multimodal Transformer for Vi...2021-03-19Code
28COTS2YesCOTS: Collaborative Two-Stream Vision-Language P...2022-04-15-
29CLIP4Clip2YesCLIP4Clip: An Empirical Study of CLIP for End to...2021-04-18Code
30HunYuan_tvr (huge)1YesTencent Text-Video Retrieval: Hierarchical Cross...2022-04-07-
31CLIP-ViP1YesCLIP-ViP: Adapting Pre-trained Image-Text Model ...2022-09-14Code
32PIDRo1No---
33DMAE (ViT-B/16)1NoDual-Modal Attention-Enhanced Text-Video Retriev...2023-09-20Code
34STAN1YesRevisiting Temporal Modeling for CLIP-based Imag...2023-01-26Code
35DRL1YesDisentangled Representation Learning for Text-Vi...2022-03-14Code
36CLIP2TV1YesCLIP2TV: Align, Match and Distill for Video-Text...2021-11-10-
37Side4Video1NoSide4Video: Spatial-Temporal Side Network for Me...2023-11-27Code
38Cap4Video1NoCap4Video: What Can Auxiliary Captions Do for Te...2022-12-31Code