Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Mingfei Han, Linjie Yang, Xiaojun Chang, Heng Wang

2023-12-16Zero-Shot Video Question Answer Video Retrieval Video Question Answering Video Summarization Video Captioning video narration captioning Video Understanding

Paper PDF Code(official)

Abstract

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

Results

Task	Dataset	Metric	Value	Model
Video	Shot2Story20K	BLEU-4	11.7	SUM-shot
Video	Shot2Story20K	CIDEr	8.6	SUM-shot
Video	Shot2Story20K	METEOR	19.7	SUM-shot
Video	Shot2Story20K	ROUGE	26.8	SUM-shot
Question Answering	MSRVTT-QA	Accuracy	56.8	SUM-shot+Vicuna
Video Question Answering	MSRVTT-QA	Accuracy	56.8	SUM-shot+Vicuna
Video Captioning	Shot2Story20K	BLEU-4	10.7	Shot2Story
Video Captioning	Shot2Story20K	CIDEr	37.4	Shot2Story
Video Captioning	Shot2Story20K	METEOR	16.2	Shot2Story
Video Captioning	Shot2Story20K	ROUGE	29.6	Shot2Story
Video Summarization	Shot2Story20K	BLEU-4	11.7	SUM-shot
Video Summarization	Shot2Story20K	CIDEr	8.6	SUM-shot
Video Summarization	Shot2Story20K	METEOR	19.7	SUM-shot
Video Summarization	Shot2Story20K	ROUGE	26.8	SUM-shot
video narration captioning	Shot2Story20K	BLEU-4	18.8	Ours
video narration captioning	Shot2Story20K	CIDEr	168.7	Ours
video narration captioning	Shot2Story20K	METEOR	24.8	Ours
video narration captioning	Shot2Story20K	ROUGE	39	Ours

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Abstract

Results

Related Papers

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Abstract

Results

Related Papers