Mingfei Han, Linjie Yang, Xiaojun Chang, Heng Wang
A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Shot2Story20K | BLEU-4 | 11.7 | SUM-shot |
| Video | Shot2Story20K | CIDEr | 8.6 | SUM-shot |
| Video | Shot2Story20K | METEOR | 19.7 | SUM-shot |
| Video | Shot2Story20K | ROUGE | 26.8 | SUM-shot |
| Question Answering | MSRVTT-QA | Accuracy | 56.8 | SUM-shot+Vicuna |
| Video Question Answering | MSRVTT-QA | Accuracy | 56.8 | SUM-shot+Vicuna |
| Video Captioning | Shot2Story20K | BLEU-4 | 10.7 | Shot2Story |
| Video Captioning | Shot2Story20K | CIDEr | 37.4 | Shot2Story |
| Video Captioning | Shot2Story20K | METEOR | 16.2 | Shot2Story |
| Video Captioning | Shot2Story20K | ROUGE | 29.6 | Shot2Story |
| Video Summarization | Shot2Story20K | BLEU-4 | 11.7 | SUM-shot |
| Video Summarization | Shot2Story20K | CIDEr | 8.6 | SUM-shot |
| Video Summarization | Shot2Story20K | METEOR | 19.7 | SUM-shot |
| Video Summarization | Shot2Story20K | ROUGE | 26.8 | SUM-shot |
| video narration captioning | Shot2Story20K | BLEU-4 | 18.8 | Ours |
| video narration captioning | Shot2Story20K | CIDEr | 168.7 | Ours |
| video narration captioning | Shot2Story20K | METEOR | 24.8 | Ours |
| video narration captioning | Shot2Story20K | ROUGE | 39 | Ours |