Unsupervised Motion Representation Learning with Capsule Autoencoders

Ziwei Xu, Xudong Shen, Yongkang Wong, Mohan S Kankanhalli

2021-10-01NeurIPS 2021 12Unsupervised Skeleton Based Action Recognition Self-Supervised Human Action Recognition Representation Learning Skeleton Based Action Recognition Action Recognition

Paper PDF Code(official)

Abstract

We propose the Motion Capsule Autoencoder (MCAE), which addresses a key challenge in the unsupervised learning of motion representations: transformation invariance. MCAE models motion in a two-level hierarchy. In the lower level, a spatio-temporal motion signal is divided into short, local, and semantic-agnostic snippets. In the higher level, the snippets are aggregated to form full-length semantic-aware segments. For both levels, we represent motion with a set of learned transformation invariant templates and the corresponding geometric transformations by using capsule autoencoders of a novel design. This leads to a robust and efficient encoding of viewpoint changes. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets. Notably, it achieves better results than baselines on Trajectory20 with considerably fewer parameters and state-of-the-art performance on the unsupervised skeleton-based action recognition task.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	NTU RGB+D 120	xset (%)	54.7	MCAE
Activity Recognition	NTU RGB+D 120	xsub (%)	52.8	MCAE
Action Recognition	NTU RGB+D 120	xset (%)	54.7	MCAE
Action Recognition	NTU RGB+D 120	xsub (%)	52.8	MCAE

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17 Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15