TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Stronger, Faster and More Explainable: A Graph Convolution...

Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition

Yi-Fan Song, Zhang Zhang, Caifeng Shan, Liang Wang

2020-10-20Skeleton Based Action RecognitionAction Recognition
PaperPDFCode

Abstract

One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the State-Of-The-Art (SOTA) models of this task tends to be exceedingly sophisticated and over-parameterized, where the low efficiency in model training and inference has obstructed the development in the field, especially for large-scale action datasets. In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. Firstly, an MIB is designed to enrich informative skeleton features and remain compact representations at an early fusion stage. Then, inspired by the success of the ResNet architecture in Convolutional Neural Network (CNN), a ResGCN module is introduced in GCN to alleviate computational costs and reduce learning difficulties in model training while maintain the model accuracy. Finally, a PartAtt block is proposed to discover the most essential body parts over a whole action sequence and obtain more explainable representations for different skeleton action sequences. Extensive experiments on two large-scale datasets, i.e., NTU RGB+D 60 and 120, validate that the proposed baseline slightly outperforms other SOTA models and meanwhile requires much fewer parameters during training and inference procedures, e.g., at most 34 times less than DGNN, which is one of the best SOTA methods.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)88.3PA-ResGCN-B19
VideoNTU RGB+D 120Accuracy (Cross-Subject)87.3PA-ResGCN-B19
VideoNTU RGB+DAccuracy (CS)90.9PA-ResGCN-B19
VideoNTU RGB+DAccuracy (CV)96PA-ResGCN-B19
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)88.3PA-ResGCN-B19
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)87.3PA-ResGCN-B19
Temporal Action LocalizationNTU RGB+DAccuracy (CS)90.9PA-ResGCN-B19
Temporal Action LocalizationNTU RGB+DAccuracy (CV)96PA-ResGCN-B19
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)88.3PA-ResGCN-B19
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)87.3PA-ResGCN-B19
Zero-Shot LearningNTU RGB+DAccuracy (CS)90.9PA-ResGCN-B19
Zero-Shot LearningNTU RGB+DAccuracy (CV)96PA-ResGCN-B19
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)88.3PA-ResGCN-B19
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)87.3PA-ResGCN-B19
Activity RecognitionNTU RGB+DAccuracy (CS)90.9PA-ResGCN-B19
Activity RecognitionNTU RGB+DAccuracy (CV)96PA-ResGCN-B19
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)88.3PA-ResGCN-B19
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)87.3PA-ResGCN-B19
Action LocalizationNTU RGB+DAccuracy (CS)90.9PA-ResGCN-B19
Action LocalizationNTU RGB+DAccuracy (CV)96PA-ResGCN-B19
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)88.3PA-ResGCN-B19
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)87.3PA-ResGCN-B19
Action DetectionNTU RGB+DAccuracy (CS)90.9PA-ResGCN-B19
Action DetectionNTU RGB+DAccuracy (CV)96PA-ResGCN-B19
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)88.3PA-ResGCN-B19
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)87.3PA-ResGCN-B19
3D Action RecognitionNTU RGB+DAccuracy (CS)90.9PA-ResGCN-B19
3D Action RecognitionNTU RGB+DAccuracy (CV)96PA-ResGCN-B19
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)88.3PA-ResGCN-B19
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)87.3PA-ResGCN-B19
Action RecognitionNTU RGB+DAccuracy (CS)90.9PA-ResGCN-B19
Action RecognitionNTU RGB+DAccuracy (CV)96PA-ResGCN-B19

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16