TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Computer Vision/Video Captioning/YouCook2

Video Captioning on YouCook2

Metric: CIDEr (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕CIDEr▼Extra DataPaperDate↕Code
1HowToCaption116.4NoHowToCaption: Prompting LLMs to Transform Video ...2023-10-07Code
2HiCM²71.84YesHiCM$^2$: Hierarchical Compact Memory Modeling f...2024-12-19Code
3Vid2Seq (HowTo100M+VidChapters-7M PT)67.2Yes---
4Vid2Seq47.1YesVid2Seq: Large-Scale Pretraining of a Visual Lan...2023-02-27Code
5CM²31.66NoDo You Remember? Dense Video Captioning with Cro...2024-04-11Code
6GVL26.52NoLearning Grounded Vision-Language Representation...2023-03-11Code
7PDVC (TSN features, no SCST)22.71NoEnd-to-End Dense Video Captioning with Parallel ...2021-08-17Code
8Vid2Seq (HowTo100M+VidChapters-7M PT)13.3Yes---
9VAST1.99YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
10UniVL + MELTR1.9NoMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
11UniVL1.81YesUniVL: A Unified Video and Language Pre-Training...2020-02-15Code
12VLM1.3869YesVLM: Task-agnostic Video-Language Model Pre-trai...2021-05-20Code
13TextKG1.33NoText with Knowledge Graph Augmented Transformer ...2023-03-22-
14COSA1.31YesCOSA: Concatenated Sample Pretrained Vision-Lang...2023-06-15Code
15MA-LMM1.31NoMA-LMM: Memory-Augmented Large Multimodal Model ...2024-04-08Code
16VideoCoCa1.28YesVideoCoCa: Video-Text Modeling with Zero-Shot Tr...2022-12-09-
17E2vidD6-MASSvid-BiD1.22YesMultimodal Pretraining for Dense Video Captioning2020-11-10Code
18OmniVL1.16NoOmniVL:One Foundation Model for Image-Language a...2022-09-15-
19COOT0.57YesCOOT: Cooperative Hierarchical Transformer for V...2020-11-01Code
20VideoBERT + S3D0.55NoVideoBERT: A Joint Model for Video and Language ...2019-04-03Code
21Zhou0.38NoEnd-to-End Dense Video Captioning with Masked Tr...2018-04-03Code