TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video Captioning/MSVD

Video Captioning on MSVD

Metric: CIDEr (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕CIDEr▼Extra DataPaperDate↕Code
1MaMMUT195.6NoMaMMUT: A Simple Architecture for Joint Learning...2023-03-29Code
2VLAB179.8YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
3VALOR178.5YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
4COSA178.5YesCOSA: Concatenated Sample Pretrained Vision-Lang...2023-06-15Code
5mPLUG-2165.8NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
6HowToCaption154.2NoHowToCaption: Prompting LLMs to Transform Video ...2023-10-07Code
7HiTeA146.9YesHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
8Vid2Seq146.2YesVid2Seq: Large-Scale Pretraining of a Visual Lan...2023-02-27Code
9VIOLETv2139.2NoAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
10RTQ123.4NoRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
11CoCap (ViT/L14)121.5NoAccurate and Fast Compressed Video Captioning2023-09-22Code
12VASTA (Vatex-backbone)119.7NoDiverse Video Captioning by Adaptive Spatio-temp...2022-08-19Code
13IcoCap (ViT-B/16)110.3Yes---
14SEM-POS108.3NoSEM-POS: Grammatically and Semantically Correct ...2023-03-26-
15VASTA (Kinetics-backbone)106.4NoDiverse Video Captioning by Adaptive Spatio-temp...2022-08-19Code
16IcoCap (ViT-B/32)103.8Yes---