TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Efficient Movie Scene Detection using State-Space Transfor...

Efficient Movie Scene Detection using State-Space Transformers

Md Mohaiminul Islam, Mahmudul Hasan, Kishan Shamsundar Athrey, Tony Braskich, Gedas Bertasius

2022-12-29CVPR 2023 1Video RecognitionScene SegmentationVideo Classification
PaperPDFCode(official)

Abstract

The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.

Results

TaskDatasetMetricValueModel
VideoBreakfastAccuracy (%)90.27TranS4mer
VideoCOINAccuracy (%)89.3TranS4mer
Semantic SegmentationMovieNetAP60.78TranS4mer
Scene SegmentationMovieNetAP60.78TranS4mer
10-shot image generationMovieNetAP60.78TranS4mer
Video ClassificationBreakfastAccuracy (%)90.27TranS4mer
Video ClassificationCOINAccuracy (%)89.3TranS4mer

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28ThermalDiffusion: Visual-to-Thermal Image-to-Image Translation for Autonomous Navigation2025-06-26Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation2025-06-14Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis2025-05-31Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification2025-05-30Video-GPT via Next Clip Diffusion2025-05-18JointDistill: Adaptive Multi-Task Distillation for Joint Depth Estimation and Scene Segmentation2025-05-15