TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Skeleton Image Representation for 3D Action Recognition ba...

Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints

Carlos Caetano, François Brémond, William Robson Schwartz

2019-09-113D Action RecognitionSkeleton Based Action RecognitionAction RecognitionTemporal Action Localization
PaperPDFCode(official)

Abstract

In the last years, the computer vision research community has studied on how to model temporal dynamics in videos to employ 3D human action recognition. To that end, two main baseline approaches have been researched: (i) Recurrent Neural Networks (RNNs) with Long-Short Term Memory (LSTM); and (ii) skeleton image representations used as input to a Convolutional Neural Network (CNN). Although RNN approaches present excellent results, such methods lack the ability to efficiently learn the spatial relations between the skeleton joints. On the other hand, the representations used to feed CNN approaches present the advantage of having the natural ability of learning structural information from 2D arrays (i.e., they learn spatial relations from the skeleton joints). To further improve such representations, we introduce the Tree Structure Reference Joints Image (TSRJI), a novel skeleton image representation to be used as input to CNNs. The proposed representation has the advantage of combining the use of reference joints and a tree structure skeleton. While the former incorporates different spatial relationships between the joints, the latter preserves important spatial relations by traversing a skeleton tree with a depth-first order algorithm. Experimental results demonstrate the effectiveness of the proposed representation for 3D action recognition on two datasets achieving state-of-the-art results on the recent NTU RGB+D~120 dataset.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+DAccuracy (CS)73.3TSRJI (Late Fusion) + HCN
VideoNTU RGB+DAccuracy (CV)80.3TSRJI (Late Fusion) + HCN
Temporal Action LocalizationNTU RGB+DAccuracy (CS)73.3TSRJI (Late Fusion) + HCN
Temporal Action LocalizationNTU RGB+DAccuracy (CV)80.3TSRJI (Late Fusion) + HCN
Zero-Shot LearningNTU RGB+DAccuracy (CS)73.3TSRJI (Late Fusion) + HCN
Zero-Shot LearningNTU RGB+DAccuracy (CV)80.3TSRJI (Late Fusion) + HCN
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)67.9TSRJI
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)62.8TSRJI
Activity RecognitionNTU RGB+DAccuracy (CS)73.3TSRJI (Late Fusion) + HCN
Activity RecognitionNTU RGB+DAccuracy (CV)80.3TSRJI (Late Fusion) + HCN
Action LocalizationNTU RGB+DAccuracy (CS)73.3TSRJI (Late Fusion) + HCN
Action LocalizationNTU RGB+DAccuracy (CV)80.3TSRJI (Late Fusion) + HCN
Action DetectionNTU RGB+DAccuracy (CS)73.3TSRJI (Late Fusion) + HCN
Action DetectionNTU RGB+DAccuracy (CV)80.3TSRJI (Late Fusion) + HCN
3D Action RecognitionNTU RGB+DAccuracy (CS)73.3TSRJI (Late Fusion) + HCN
3D Action RecognitionNTU RGB+DAccuracy (CV)80.3TSRJI (Late Fusion) + HCN
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)67.9TSRJI
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)62.8TSRJI
Action RecognitionNTU RGB+DAccuracy (CS)73.3TSRJI (Late Fusion) + HCN
Action RecognitionNTU RGB+DAccuracy (CV)80.3TSRJI (Late Fusion) + HCN

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22