Li Haopeng, Ke Qiuhong, Gong Mingming, Tom Drummond
Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | TvSum | F1-score (Canonical) | 60.4 | SSPVS(+Text) |
| Video | TvSum | Kendall's Tau | 0.181 | SSPVS(+Text) |
| Video | TvSum | Spearman's Rho | 0.238 | SSPVS(+Text) |
| Video | TvSum | F1-score (Augmented) | 61.8 | SSPVS |
| Video | TvSum | F1-score (Canonical) | 60.3 | SSPVS |
| Video | TvSum | Kendall's Tau | 0.177 | SSPVS |
| Video | TvSum | Spearman's Rho | 0.233 | SSPVS |
| Video | SumMe | F1-score (Canonical) | 50.7 | SSPVS(+Text) |
| Video | SumMe | Kendall's Tau | 0.192 | SSPVS(+Text) |
| Video | SumMe | Spearman's Rho | 0.257 | SSPVS(+Text) |
| Video | SumMe | F1-score (Augmented) | 50.4 | SSPVS |
| Video | SumMe | F1-score (Canonical) | 48.7 | SSPVS |
| Video | SumMe | Kendall's Tau | 0.178 | SSPVS |
| Video | SumMe | Spearman's Rho | 0.24 | SSPVS |
| Video Summarization | TvSum | F1-score (Canonical) | 60.4 | SSPVS(+Text) |
| Video Summarization | TvSum | Kendall's Tau | 0.181 | SSPVS(+Text) |
| Video Summarization | TvSum | Spearman's Rho | 0.238 | SSPVS(+Text) |
| Video Summarization | TvSum | F1-score (Augmented) | 61.8 | SSPVS |
| Video Summarization | TvSum | F1-score (Canonical) | 60.3 | SSPVS |
| Video Summarization | TvSum | Kendall's Tau | 0.177 | SSPVS |
| Video Summarization | TvSum | Spearman's Rho | 0.233 | SSPVS |
| Video Summarization | SumMe | F1-score (Canonical) | 50.7 | SSPVS(+Text) |
| Video Summarization | SumMe | Kendall's Tau | 0.192 | SSPVS(+Text) |
| Video Summarization | SumMe | Spearman's Rho | 0.257 | SSPVS(+Text) |
| Video Summarization | SumMe | F1-score (Augmented) | 50.4 | SSPVS |
| Video Summarization | SumMe | F1-score (Canonical) | 48.7 | SSPVS |
| Video Summarization | SumMe | Kendall's Tau | 0.178 | SSPVS |
| Video Summarization | SumMe | Spearman's Rho | 0.24 | SSPVS |