TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video Retrieval/MSR-VTT

Video Retrieval on MSR-VTT

Metric: text-to-video R@5 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕text-to-video R@5▼Extra DataPaperDate↕Code
1VAST84.3YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
2VALOR83.5YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
3UMT-L (ViT-L/16)81YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
4vid-TLDR (UMT-L)81Yesvid-TLDR: Training Free Token merging for Light-...2024-03-20Code
5VLAB78.8YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
6TEFAL76.6NoAudio-Enhanced Text-to-Video Retrieval using Tex...2023-07-24-
7All-in-one + MELTR74.4YesMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
8OmniVL74.2YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
9Aurora (ours, r=64)73.9No---
10UCoFiA72.1NoUnified Coarse-to-Fine Alignment for Video-Text ...2023-09-18Code
11CLIP4Clip-seqTransf71.4NoCLIP4Clip: An Empirical Study of CLIP for End to...2021-04-18Code
12HD-VILA65.3NoAdvancing High-Resolution Video-Language Represe...2021-11-19Code
13VIOLETv264.8YesAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
14VIOLET + MELTR63.7NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
15FROZEN61.5NoFrozen in Time: A Joint Video and Image Encoder ...2021-04-01Code
16COTS60.8NoCOTS: Collaborative Two-Stream Vision-Language P...2022-04-15-
17MDMMT-260.5YesMDMMT-2: Multidomain Multimodal Transformer for ...2022-03-14-
18CLIP2TV58.9YesCLIP2TV: Align, Match and Distill for Video-Text...2021-11-10-
19CAMoE58.3YesImproving Video-Text Retrieval by Multi-Stream C...2021-09-09Code
20VideoCoCa (zero-shot)57.8YesVideoCoCa: Video-Text Modeling with Zero-Shot Tr...2022-12-09-
21Ours56.7NoVideo and Text Matching with Conditioned Embeddi...2021-10-21Code
22CLIP2Video55.5YesCLIP2Video: Mastering Video-Text Retrieval via I...2021-06-21Code
23UniVL + MELTR55.5NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
24LAFF54.9NoLightweight Attentional Feature Fusion: A New Ba...2021-12-03Code
25CoCa (zero-shot)52.4YesCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
26TACo52.1YesTACo: Token-aware Cascade Contrastive Learning f...2021-08-23-
27MDMMT49.8YesMDMMT: Multidomain Multimodal Transformer for Vi...2021-03-19Code
28UniVL49.6YesUniVL: A Unified Video and Language Pre-Training...2020-02-15Code
29CLIP41.1NoA Straightforward Framework For Video Retrieval ...2021-02-24Code
30RoME29.6NoRoME: Role-aware Mixture-of-Expert Transformer f...2022-06-26Code
31Collaborative Experts29NoUse What You Have: Video Retrieval Using Represe...2019-07-31Code
32JEMC20.9No--Code