Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks

Pichao Wang, Wanqing Li, Chuankun Li, Yonghong Hou

2016-12-30Skeleton Based Action Recognition Action Recognition Temporal Action Localization

Abstract

Convolutional Neural Networks (ConvNets) have recently shown promising performance in many computer vision tasks, especially image-based recognition. How to effectively apply ConvNets to sequence-based data is still an open problem. This paper proposes an effective yet simple method to represent spatio-temporal information carried in $3D$ skeleton sequences into three $2D$ images by encoding the joint trajectories and their dynamics into color distribution in the images, referred to as Joint Trajectory Maps (JTM), and adopts ConvNets to learn the discriminative features for human action recognition. Such an image-based representation enables us to fine-tune existing ConvNets models for the classification of skeleton sequences without training the networks afresh. The three JTMs are generated in three orthogonal planes and provide complimentary information to each other. The final recognition is further improved through multiply score fusion of the three JTMs. The proposed method was evaluated on four public benchmark datasets, the large NTU RGB+D Dataset, MSRC-12 Kinect Gesture Dataset (MSRC-12), G3D Dataset and UTD Multimodal Human Action Dataset (UTD-MHAD) and achieved the state-of-the-art results.

Results

Task	Dataset	Metric	Value	Model
Video	Gaming 3D (G3D)	Accuracy	96	CNN
Temporal Action Localization	Gaming 3D (G3D)	Accuracy	96	CNN
Zero-Shot Learning	Gaming 3D (G3D)	Accuracy	96	CNN
Activity Recognition	Gaming 3D (G3D)	Accuracy	96	CNN
Action Localization	Gaming 3D (G3D)	Accuracy	96	CNN
Action Detection	Gaming 3D (G3D)	Accuracy	96	CNN
3D Action Recognition	Gaming 3D (G3D)	Accuracy	96	CNN
Action Recognition	Gaming 3D (G3D)	Accuracy	96	CNN

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26 Feature Hallucination for Self-supervised Action Recognition2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25 Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23 Adapting Vision-Language Models for Evaluating World Models2025-06-22