A Perceptual Prediction Framework for Self Supervised Event Segmentation

Sathyanarayanan N. Aakur, Sudeep Sarkar

2018-11-12CVPR 2019 6Event Segmentation Representation Learning Action Localization Unsupervised Action Segmentation Prediction Action Recognition

Paper PDF Code(official)

Abstract

Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data. In this paper, we tackle the problem of self-supervised temporal segmentation of long videos that alleviate the need for any supervision. We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into individual, stable segments that share the same semantics. We also introduce a new adaptive learning paradigm that helps reduce the effect of catastrophic forgetting in recurrent neural networks. Extensive experiments on three publicly available datasets - Breakfast Actions, 50 Salads, and INRIA Instructional Videos datasets show the efficacy of the proposed approach. We show that the proposed approach is able to outperform weakly-supervised and other unsupervised learning approaches by up to 24% and have competitive performance compared to fully supervised approaches. We also show that the proposed approach is able to learn highly discriminative features that help improve action recognition when used in a representation learning paradigm.

Results

Task	Dataset	Metric	Value	Model
Action Localization	50 Salads	Acc	60.6	LSTM+AL
Action Localization	Youtube INRIA Instructional	F1	39.7	LSTM+AL
Action Localization	Breakfast	Acc	42.9	LSTM+AL
Action Localization	Breakfast	mIoU	46.9	LSTM+AL
Action Segmentation	50 Salads	Acc	60.6	LSTM+AL
Action Segmentation	Youtube INRIA Instructional	F1	39.7	LSTM+AL
Action Segmentation	Breakfast	Acc	42.9	LSTM+AL
Action Segmentation	Breakfast	mIoU	46.9	LSTM+AL

Related Papers

Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21 Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17 Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16