Summarizing Videos with Attention

Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, Paolo Remagnino

2018-12-05Video Summarization

Paper PDF Code Code Code(official)Code Code

Abstract

In this work we propose a novel method for supervised, keyshots based video summarization by applying a conceptually simple and computationally efficient soft, self-attention mechanism. Current state of the art methods leverage bi-directional recurrent networks such as BiLSTM combined with attention. These networks are complex to implement and computationally demanding compared to fully connected networks. To that end we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training. Our method sets a new state of the art results on two benchmarks TvSum and SumMe, commonly used in this domain.

Results

Task	Dataset	Metric	Value	Model
Video	TvSum	F1-score (Augmented)	62.37	VASNet
Video	TvSum	F1-score (Canonical)	61.42	VASNet
Video	SumMe	F1-score (Augmented)	51.09	VASNet
Video	SumMe	F1-score (Canonical)	49.71	VASNet
Video Summarization	TvSum	F1-score (Augmented)	62.37	VASNet
Video Summarization	TvSum	F1-score (Canonical)	61.42	VASNet
Video Summarization	SumMe	F1-score (Augmented)	51.09	VASNet
Video Summarization	SumMe	F1-score (Canonical)	49.71	VASNet

Related Papers

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness2025-06-25 MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment2025-06-12 Prompts to Summaries: Zero-Shot Language-Guided Video Summarization2025-06-12 Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization2025-06-10 TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations2025-06-03 Unsupervised Transcript-assisted Video Summarization and Highlight Detection2025-05-29 REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing2025-05-24 SD-VSum: A Method and Dataset for Script-Driven Video Summarization2025-05-06