TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video Retrieval/MSR-VTT

Video Retrieval on MSR-VTT

Metric: text-to-video R@10 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video R@10▼Extra DataPaperDate↕Code
1VAST89.6YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
2VALOR89.6YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
3GRAM89.3YesGramian Multimodal Representation Learning and A...2024-12-16Code
4VLAB87.6YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
5UMT-L (ViT-L/16)87.1YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
6TEFAL86.1NoAudio-Enhanced Text-to-Video Retrieval using Tex...2023-07-24-
7All-in-one + MELTR84.7YesMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
8OmniVL83.8YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
9UCoFiA83.5NoUnified Coarse-to-Fine Alignment for Video-Text ...2023-09-18Code
10Aurora (ours, r=64)82No---
11vid-TLDR (UMT-L)81.6Yesvid-TLDR: Training Free Token merging for Light-...2024-03-20Code
12CLIP4Clip-seqTransf81.6NoCLIP4Clip: An Empirical Study of CLIP for End to...2021-04-18Code
13HD-VILA78NoAdvancing High-Resolution Video-Language Represe...2021-11-19Code
14VIOLET + MELTR77.8NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
15VIOLETv275.8YesAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
16FROZEN71.2NoFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
17MDMMT-270.8YesMDMMT-2: Multidomain Multimodal Transformer for ...2022-03-14-
18COTS70.2NoCOTS: Collaborative Two-Stream Vision-Language P...2022-04-15-
19CLIP2TV68.9YesCLIP2TV: Align, Match and Distill for Video-Text...2021-11-10-
20CAMoE68.4YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
21UniVL + MELTR67.6NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
22VideoCoCa (zero-shot)67YesVideoCoCa: Video-Text Modeling with Zero-Shot Tr...2022-12-09-
23CLIP2Video66.2YesCLIP2Video: Mastering Video-Text Retrieval via I...2021-06-21Code
24LAFF65.8NoLightweight Attentional Feature Fusion: A New Ba...2021-12-03Code
25TACo64YesTACo: Token-aware Cascade Contrastive Learning f...2021-08-23-
26UniVL63.1YesUniVL: A Unified Video and Language Pre-Training...2020-02-15Code
27MDMMT61.8YesMDMMT: Multidomain Multimodal Transformer for Vi...2021-03-19Code
28CoCa (zero-shot)61.6YesCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
29Text-Video Embedding52.8NoHowTo100M: Learning a Text-Video Embedding by Wa...2019-06-07Code
30CLIP50.4NoA Straightforward Framework For Video Retrieval ...2021-02-24Code
31JSFusion43.2NoA Joint Sequence Fusion Model for Video Question...2018-08-07Code
32RoME41.2NoRoME: Role-aware Mixture-of-Expert Transformer f...2022-06-26Code
33Collaborative Experts41.2NoUse What You Have: Video Retrieval Using Represe...2019-07-31Code
34JEMC29.7No--Code
35Kaufman24.1NoTemporal Tessellation: A Unified Approach for Vi...2016-12-21Code
36C+LSTM+SA+FC719.9NoLearning Language-Visual Embedding for Movie Und...2016-09-26-