TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/End-to-End Dense Video Captioning with Parallel Decoding

End-to-End Dense Video Captioning with Parallel Decoding

Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, Ping Luo

2021-08-17ICCV 2021 10Caption GenerationVideo CaptioningDense Video Captioning
PaperPDFCode(official)Code

Abstract

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated "localize-then-describe" scheme, which heavily relies on numerous hand-crafted components. In this paper, we proposed a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. In practice, through stacking a newly proposed event counter on the top of a transformer decoder, the PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content, which effectively increases the coherence and readability of predicted captions. Compared with prior arts, the PDVC has several appealing advantages: (1) Without relying on heuristic non-maximum suppression or a recurrent event sequence selection network to remove redundancy, PDVC directly produces an event set with an appropriate size; (2) In contrast to adopting the two-stage scheme, we feed the enhanced representations of event queries into the localization head and caption head in parallel, making these two sub-tasks deeply interrelated and mutually promoted through the optimization; (3) Without bells and whistles, extensive experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results, surpassing the state-of-the-art two-stage methods when its localization accuracy is on par with them. Code is available at https://github.com/ttengwang/PDVC.

Results

TaskDatasetMetricValueModel
Video CaptioningYouCook2BLEU40.8PDVC (TSN features, no SCST)
Video CaptioningYouCook2CIDEr22.71PDVC (TSN features, no SCST)
Video CaptioningYouCook2METEOR4.74PDVC (TSN features, no SCST)
Video CaptioningYouCook2SODA4.42PDVC (TSN features, no SCST)
Video CaptioningActivityNet CaptionsBLEU-42.17PDVC (TSP features, no SCST)
Video CaptioningActivityNet CaptionsCIDEr31.14PDVC (TSP features, no SCST)
Video CaptioningActivityNet CaptionsMETEOR9.03PDVC (TSP features, no SCST)
Video CaptioningActivityNet CaptionsSODA6.05PDVC (TSP features, no SCST)
Dense Video CaptioningYouCook2BLEU40.8PDVC (TSN features, no SCST)
Dense Video CaptioningYouCook2CIDEr22.71PDVC (TSN features, no SCST)
Dense Video CaptioningYouCook2METEOR4.74PDVC (TSN features, no SCST)
Dense Video CaptioningYouCook2SODA4.42PDVC (TSN features, no SCST)
Dense Video CaptioningActivityNet CaptionsBLEU-42.17PDVC (TSP features, no SCST)
Dense Video CaptioningActivityNet CaptionsCIDEr31.14PDVC (TSP features, no SCST)
Dense Video CaptioningActivityNet CaptionsMETEOR9.03PDVC (TSP features, no SCST)
Dense Video CaptioningActivityNet CaptionsSODA6.05PDVC (TSP features, no SCST)

Related Papers

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World2025-06-30Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning2025-06-18video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits2025-06-11