TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/End-to-End Dense Video Captioning with Masked Transformer

End-to-End Dense Video Captioning with Masked Transformer

Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, Caiming Xiong

2018-04-03CVPR 2018 4Video CaptioningDense Video Captioning
PaperPDFCode

Abstract

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.

Results

TaskDatasetMetricValueModel
Video CaptioningYouCook2BLEU-37.53Zhou
Video CaptioningYouCook2BLEU-44.38Zhou
Video CaptioningYouCook2CIDEr0.38Zhou
Video CaptioningYouCook2METEOR11.55Zhou
Video CaptioningYouCook2ROUGE-L27.44Zhou

Related Papers

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks2025-06-10ARGUS: Hallucination and Omission Evaluation in Video-LLMs2025-06-09Temporal Object Captioning for Street Scene Videos from LiDAR Tracks2025-05-22FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks2025-05-19