TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Progressive Video Summarization via Multimodal Self-superv...

Progressive Video Summarization via Multimodal Self-supervised Learning

Li Haopeng, Ke Qiuhong, Gong Mingming, Tom Drummond

2022-01-07Self-Supervised LearningSupervised Video SummarizationVideo SummarizationVideo Classification
PaperPDFCodeCode(official)

Abstract

Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score.

Results

TaskDatasetMetricValueModel
VideoTvSumF1-score (Canonical)60.4SSPVS(+Text)
VideoTvSumKendall's Tau0.181SSPVS(+Text)
VideoTvSumSpearman's Rho0.238SSPVS(+Text)
VideoTvSumF1-score (Augmented)61.8SSPVS
VideoTvSumF1-score (Canonical)60.3SSPVS
VideoTvSumKendall's Tau0.177SSPVS
VideoTvSumSpearman's Rho0.233SSPVS
VideoSumMeF1-score (Canonical)50.7SSPVS(+Text)
VideoSumMeKendall's Tau0.192SSPVS(+Text)
VideoSumMeSpearman's Rho0.257SSPVS(+Text)
VideoSumMeF1-score (Augmented)50.4SSPVS
VideoSumMeF1-score (Canonical)48.7SSPVS
VideoSumMeKendall's Tau0.178SSPVS
VideoSumMeSpearman's Rho0.24SSPVS
Video SummarizationTvSumF1-score (Canonical)60.4SSPVS(+Text)
Video SummarizationTvSumKendall's Tau0.181SSPVS(+Text)
Video SummarizationTvSumSpearman's Rho0.238SSPVS(+Text)
Video SummarizationTvSumF1-score (Augmented)61.8SSPVS
Video SummarizationTvSumF1-score (Canonical)60.3SSPVS
Video SummarizationTvSumKendall's Tau0.177SSPVS
Video SummarizationTvSumSpearman's Rho0.233SSPVS
Video SummarizationSumMeF1-score (Canonical)50.7SSPVS(+Text)
Video SummarizationSumMeKendall's Tau0.192SSPVS(+Text)
Video SummarizationSumMeSpearman's Rho0.257SSPVS(+Text)
Video SummarizationSumMeF1-score (Augmented)50.4SSPVS
Video SummarizationSumMeF1-score (Canonical)48.7SSPVS
Video SummarizationSumMeKendall's Tau0.178SSPVS
Video SummarizationSumMeSpearman's Rho0.24SSPVS

Related Papers

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model2025-07-01ShapeEmbed: a self-supervised learning framework for 2D contour quantification2025-07-01ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models2025-06-27Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features2025-06-26