Unsupervised Discriminative Embedding for Sub-Action Learning in Complex Activities

Sirnam Swetha, Hilde Kuehne, Yogesh S Rawat, Mubarak Shah

2021-04-30Action Segmentation Action Recognition

Abstract

Action recognition and detection in the context of long untrimmed video sequences has seen an increased attention from the research community. However, annotation of complex activities is usually time consuming and challenging in practice. Therefore, recent works started to tackle the problem of unsupervised learning of sub-actions in complex activities. This paper proposes a novel approach for unsupervised sub-action learning in complex activities. The proposed method maps both visual and temporal representations to a latent space where the sub-actions are learnt discriminatively in an end-to-end fashion. To this end, we propose to learn sub-actions as latent concepts and a novel discriminative latent concept learning (DLCL) module aids in learning sub-actions. The proposed DLCL module lends on the idea of latent concepts to learn compact representations in the latent embedding space in an unsupervised way. The result is a set of latent vectors that can be interpreted as cluster centers in the embedding space. The latent space itself is formed by a joint visual and temporal embedding capturing the visual similarity and temporal ordering of the data. Our joint learning with discriminative latent concept module is novel which eliminates the need for explicit clustering. We validate our approach on three benchmark datasets and show that the proposed combination of visual-temporal embedding and discriminative latent concepts allow to learn robust action representations in an unsupervised setting.

Results

Task	Dataset	Metric	Value	Model
Action Localization	Breakfast	Acc	47.4	UDE
Action Localization	Breakfast	F1@50%	31.9	UDE
Action Segmentation	Breakfast	Acc	47.4	UDE
Action Segmentation	Breakfast	F1@50%	31.9	UDE

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Self-supervised pretraining of vision transformers for animal behavioral analysis and neural encoding2025-07-13 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26 Feature Hallucination for Self-supervised Action Recognition2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25 Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23 Adapting Vision-Language Models for Evaluating World Models2025-06-22