TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video Captioning/MSR-VTT

Video Captioning on MSR-VTT

Metric: CIDEr (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕CIDEr▼Extra DataPaperDate↕Code
1mPLUG-280NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
2VAST78YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
3GIT275.9YesGIT: A Generative Image-to-text Transformer for ...2022-05-27Code
4VLAB74.9YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
5COSA74.7YesCOSA: Concatenated Sample Pretrained Vision-Lang...2023-06-15Code
6VALOR74YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
7MaMMUT (ours)73.6NoMaMMUT: A Simple Architecture for Joint Learning...2023-03-29Code
8VideoCoCa73.2YesVideoCoCa: Video-Text Modeling with Zero-Shot Tr...2022-12-09-
9RTQ69.3YesRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
10HowToCaption65.3NoHowToCaption: Prompting LLMs to Transform Video ...2023-10-07Code
11HiTeA65.1YesHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
12Vid2Seq64.6YesVid2Seq: Large-Scale Pretraining of a Visual Lan...2023-02-27Code
13TextKG60.8NoText with Knowledge Graph Augmented Transformer ...2023-03-22-
14IcoCap (ViT-B/16)60.2Yes---
15MV-GPT60YesEnd-to-end Generative Pretraining for Multimodal...2022-01-20-
16IcoCap (ViT-B/32)59.1Yes---
17CLIP-DCD58.7NoCLIP Meets Video Captioning: Concept-Aware Repre...2021-11-30Code
18VIOLETv258NoAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
19CoCap (ViT/L14)57.2NoAccurate and Fast Compressed Video Captioning2023-09-22Code
20VASTA (Vatex-backbone)56.08NoDiverse Video Captioning by Adaptive Spatio-temp...2022-08-19Code
21VASTA (Kinetics-backbone)55NoDiverse Video Captioning by Adaptive Spatio-temp...2022-08-19Code
22EMCL-Net54.6NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
23SEM-POS53.1NoSEM-POS: Grammatically and Semantically Correct ...2023-03-26-
24UniVL + MELTR52.77NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code