TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Shot2Story20K: A New Benchmark for Comprehensive Understan...

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Mingfei Han, Linjie Yang, Xiaojun Chang, Heng Wang

2023-12-16Zero-Shot Video Question AnswerVideo RetrievalVideo Question AnsweringVideo SummarizationVideo Captioningvideo narration captioningVideo Understanding
PaperPDFCode(official)

Abstract

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

Results

TaskDatasetMetricValueModel
VideoShot2Story20KBLEU-411.7SUM-shot
VideoShot2Story20KCIDEr8.6SUM-shot
VideoShot2Story20KMETEOR19.7SUM-shot
VideoShot2Story20KROUGE26.8SUM-shot
Question AnsweringMSRVTT-QAAccuracy56.8SUM-shot+Vicuna
Video Question AnsweringMSRVTT-QAAccuracy56.8SUM-shot+Vicuna
Video CaptioningShot2Story20KBLEU-410.7Shot2Story
Video CaptioningShot2Story20KCIDEr37.4Shot2Story
Video CaptioningShot2Story20KMETEOR16.2Shot2Story
Video CaptioningShot2Story20KROUGE29.6Shot2Story
Video SummarizationShot2Story20KBLEU-411.7SUM-shot
Video SummarizationShot2Story20KCIDEr8.6SUM-shot
Video SummarizationShot2Story20KMETEOR19.7SUM-shot
Video SummarizationShot2Story20KROUGE26.8SUM-shot
video narration captioningShot2Story20KBLEU-418.8Ours
video narration captioningShot2Story20KCIDEr168.7Ours
video narration captioningShot2Story20KMETEOR24.8Ours
video narration captioningShot2Story20KROUGE39Ours

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08