TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multimodal Pretraining for Dense Video Captioning

Multimodal Pretraining for Dense Video Captioning

Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut

2020-11-10Asian Chapter of the Association for Computational Linguistics 2020Video CaptioningDense Video Captioning
PaperPDFCode(official)

Abstract

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

Results

TaskDatasetMetricValueModel
Video CaptioningYouCook2BLEU-412.04E2vidD6-MASSvid-BiD
Video CaptioningYouCook2CIDEr1.22E2vidD6-MASSvid-BiD
Video CaptioningYouCook2METEOR18.32E2vidD6-MASSvid-BiD
Video CaptioningYouCook2ROUGE-L39.03E2vidD6-MASSvid-BiD
Video CaptioningYouCook2ROUGE-L39.03E2vidD6-MASSalign-BiD
Dense Video CaptioningYouCook2ROUGE-L39.03E2vidD6-MASSalign-BiD

Related Papers

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks2025-06-10ARGUS: Hallucination and Omission Evaluation in Video-LLMs2025-06-09Temporal Object Captioning for Street Scene Videos from LiDAR Tracks2025-05-22FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks2025-05-19