TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Spatial Temporal Graph Convolutional Networks for Skeleton...

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Sijie Yan, Yuanjun Xiong, Dahua Lin

2018-01-233D Human Pose EstimationSkeleton Based Action RecognitionMultimodal Activity RecognitionAction RecognitionTemporal Action Localization
PaperPDFCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)88.4ST-GCN [PYSKL, 3D Skeleton]
VideoNTU RGB+D 120Accuracy (Cross-Subject)86.2ST-GCN [PYSKL, 3D Skeleton]
VideoNTU RGB+D 120Accuracy (Cross-Setup)89ST-GCN [PYSKL, 2D Skeleton]
VideoNTU RGB+D 120Accuracy (Cross-Subject)84.7ST-GCN [PYSKL, 2D Skeleton]
VideoUAV-HumanCSv1(%)30.25ST-GCN
VideoUAV-HumanCSv2(%)56.14ST-GCN
VideoNTU RGB+DAccuracy (CS)90.7ST-GCN [PYSKL, 3D Skeleton]
VideoNTU RGB+DAccuracy (CV)96.5ST-GCN [PYSKL, 3D Skeleton]
VideoNTU RGB+DAccuracy (CS)90.1ST-GCN [Vanilla, 2D Skeleton]
VideoNTU RGB+DAccuracy (CV)95.1ST-GCN [Vanilla, 2D Skeleton]
VideoNTU RGB+DAccuracy (CS)86.6ST-GCN [Vanilla, 3D Skeleton]
VideoNTU RGB+DAccuracy (CV)93.2ST-GCN [Vanilla, 3D Skeleton]
VideoNTU RGB+DAccuracy (CS)81.5ST-GCN
VideoNTU RGB+DAccuracy (CV)88.3ST-GCN
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)88.4ST-GCN [PYSKL, 3D Skeleton]
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)86.2ST-GCN [PYSKL, 3D Skeleton]
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)89ST-GCN [PYSKL, 2D Skeleton]
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)84.7ST-GCN [PYSKL, 2D Skeleton]
Temporal Action LocalizationUAV-HumanCSv1(%)30.25ST-GCN
Temporal Action LocalizationUAV-HumanCSv2(%)56.14ST-GCN
Temporal Action LocalizationNTU RGB+DAccuracy (CS)90.7ST-GCN [PYSKL, 3D Skeleton]
Temporal Action LocalizationNTU RGB+DAccuracy (CV)96.5ST-GCN [PYSKL, 3D Skeleton]
Temporal Action LocalizationNTU RGB+DAccuracy (CS)90.1ST-GCN [Vanilla, 2D Skeleton]
Temporal Action LocalizationNTU RGB+DAccuracy (CV)95.1ST-GCN [Vanilla, 2D Skeleton]
Temporal Action LocalizationNTU RGB+DAccuracy (CS)86.6ST-GCN [Vanilla, 3D Skeleton]
Temporal Action LocalizationNTU RGB+DAccuracy (CV)93.2ST-GCN [Vanilla, 3D Skeleton]
Temporal Action LocalizationNTU RGB+DAccuracy (CS)81.5ST-GCN
Temporal Action LocalizationNTU RGB+DAccuracy (CV)88.3ST-GCN
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)88.4ST-GCN [PYSKL, 3D Skeleton]
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)86.2ST-GCN [PYSKL, 3D Skeleton]
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)89ST-GCN [PYSKL, 2D Skeleton]
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)84.7ST-GCN [PYSKL, 2D Skeleton]
Zero-Shot LearningUAV-HumanCSv1(%)30.25ST-GCN
Zero-Shot LearningUAV-HumanCSv2(%)56.14ST-GCN
Zero-Shot LearningNTU RGB+DAccuracy (CS)90.7ST-GCN [PYSKL, 3D Skeleton]
Zero-Shot LearningNTU RGB+DAccuracy (CV)96.5ST-GCN [PYSKL, 3D Skeleton]
Zero-Shot LearningNTU RGB+DAccuracy (CS)90.1ST-GCN [Vanilla, 2D Skeleton]
Zero-Shot LearningNTU RGB+DAccuracy (CV)95.1ST-GCN [Vanilla, 2D Skeleton]
Zero-Shot LearningNTU RGB+DAccuracy (CS)86.6ST-GCN [Vanilla, 3D Skeleton]
Zero-Shot LearningNTU RGB+DAccuracy (CV)93.2ST-GCN [Vanilla, 3D Skeleton]
Zero-Shot LearningNTU RGB+DAccuracy (CS)81.5ST-GCN
Zero-Shot LearningNTU RGB+DAccuracy (CV)88.3ST-GCN
Activity RecognitionH2O (2 Hands and Objects)Actions Top-173.86ST-GCN
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)88.4ST-GCN [PYSKL, 3D Skeleton]
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)86.2ST-GCN [PYSKL, 3D Skeleton]
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)89ST-GCN [PYSKL, 2D Skeleton]
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)84.7ST-GCN [PYSKL, 2D Skeleton]
Activity RecognitionUAV-HumanCSv1(%)30.25ST-GCN
Activity RecognitionUAV-HumanCSv2(%)56.14ST-GCN
Activity RecognitionNTU RGB+DAccuracy (CS)90.7ST-GCN [PYSKL, 3D Skeleton]
Activity RecognitionNTU RGB+DAccuracy (CV)96.5ST-GCN [PYSKL, 3D Skeleton]
Activity RecognitionNTU RGB+DAccuracy (CS)90.1ST-GCN [Vanilla, 2D Skeleton]
Activity RecognitionNTU RGB+DAccuracy (CV)95.1ST-GCN [Vanilla, 2D Skeleton]
Activity RecognitionNTU RGB+DAccuracy (CS)86.6ST-GCN [Vanilla, 3D Skeleton]
Activity RecognitionNTU RGB+DAccuracy (CV)93.2ST-GCN [Vanilla, 3D Skeleton]
Activity RecognitionNTU RGB+DAccuracy (CS)81.5ST-GCN
Activity RecognitionNTU RGB+DAccuracy (CV)88.3ST-GCN
Activity RecognitionEV-ActionAccuracy79.6ST-GCN (Skeleton Kinect)
Activity RecognitionEV-ActionAccuracy50.7ST-GCN (Skeleton Vicon)
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)88.4ST-GCN [PYSKL, 3D Skeleton]
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)86.2ST-GCN [PYSKL, 3D Skeleton]
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)89ST-GCN [PYSKL, 2D Skeleton]
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)84.7ST-GCN [PYSKL, 2D Skeleton]
Action LocalizationUAV-HumanCSv1(%)30.25ST-GCN
Action LocalizationUAV-HumanCSv2(%)56.14ST-GCN
Action LocalizationNTU RGB+DAccuracy (CS)90.7ST-GCN [PYSKL, 3D Skeleton]
Action LocalizationNTU RGB+DAccuracy (CV)96.5ST-GCN [PYSKL, 3D Skeleton]
Action LocalizationNTU RGB+DAccuracy (CS)90.1ST-GCN [Vanilla, 2D Skeleton]
Action LocalizationNTU RGB+DAccuracy (CV)95.1ST-GCN [Vanilla, 2D Skeleton]
Action LocalizationNTU RGB+DAccuracy (CS)86.6ST-GCN [Vanilla, 3D Skeleton]
Action LocalizationNTU RGB+DAccuracy (CV)93.2ST-GCN [Vanilla, 3D Skeleton]
Action LocalizationNTU RGB+DAccuracy (CS)81.5ST-GCN
Action LocalizationNTU RGB+DAccuracy (CV)88.3ST-GCN
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)88.4ST-GCN [PYSKL, 3D Skeleton]
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)86.2ST-GCN [PYSKL, 3D Skeleton]
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)89ST-GCN [PYSKL, 2D Skeleton]
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)84.7ST-GCN [PYSKL, 2D Skeleton]
Action DetectionUAV-HumanCSv1(%)30.25ST-GCN
Action DetectionUAV-HumanCSv2(%)56.14ST-GCN
Action DetectionNTU RGB+DAccuracy (CS)90.7ST-GCN [PYSKL, 3D Skeleton]
Action DetectionNTU RGB+DAccuracy (CV)96.5ST-GCN [PYSKL, 3D Skeleton]
Action DetectionNTU RGB+DAccuracy (CS)90.1ST-GCN [Vanilla, 2D Skeleton]
Action DetectionNTU RGB+DAccuracy (CV)95.1ST-GCN [Vanilla, 2D Skeleton]
Action DetectionNTU RGB+DAccuracy (CS)86.6ST-GCN [Vanilla, 3D Skeleton]
Action DetectionNTU RGB+DAccuracy (CV)93.2ST-GCN [Vanilla, 3D Skeleton]
Action DetectionNTU RGB+DAccuracy (CS)81.5ST-GCN
Action DetectionNTU RGB+DAccuracy (CV)88.3ST-GCN
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)88.4ST-GCN [PYSKL, 3D Skeleton]
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)86.2ST-GCN [PYSKL, 3D Skeleton]
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)89ST-GCN [PYSKL, 2D Skeleton]
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)84.7ST-GCN [PYSKL, 2D Skeleton]
3D Action RecognitionUAV-HumanCSv1(%)30.25ST-GCN
3D Action RecognitionUAV-HumanCSv2(%)56.14ST-GCN
3D Action RecognitionNTU RGB+DAccuracy (CS)90.7ST-GCN [PYSKL, 3D Skeleton]
3D Action RecognitionNTU RGB+DAccuracy (CV)96.5ST-GCN [PYSKL, 3D Skeleton]
3D Action RecognitionNTU RGB+DAccuracy (CS)90.1ST-GCN [Vanilla, 2D Skeleton]
3D Action RecognitionNTU RGB+DAccuracy (CV)95.1ST-GCN [Vanilla, 2D Skeleton]
3D Action RecognitionNTU RGB+DAccuracy (CS)86.6ST-GCN [Vanilla, 3D Skeleton]
3D Action RecognitionNTU RGB+DAccuracy (CV)93.2ST-GCN [Vanilla, 3D Skeleton]
3D Action RecognitionNTU RGB+DAccuracy (CS)81.5ST-GCN
3D Action RecognitionNTU RGB+DAccuracy (CV)88.3ST-GCN
Action RecognitionH2O (2 Hands and Objects)Actions Top-173.86ST-GCN
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)88.4ST-GCN [PYSKL, 3D Skeleton]
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)86.2ST-GCN [PYSKL, 3D Skeleton]
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)89ST-GCN [PYSKL, 2D Skeleton]
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)84.7ST-GCN [PYSKL, 2D Skeleton]
Action RecognitionUAV-HumanCSv1(%)30.25ST-GCN
Action RecognitionUAV-HumanCSv2(%)56.14ST-GCN
Action RecognitionNTU RGB+DAccuracy (CS)90.7ST-GCN [PYSKL, 3D Skeleton]
Action RecognitionNTU RGB+DAccuracy (CV)96.5ST-GCN [PYSKL, 3D Skeleton]
Action RecognitionNTU RGB+DAccuracy (CS)90.1ST-GCN [Vanilla, 2D Skeleton]
Action RecognitionNTU RGB+DAccuracy (CV)95.1ST-GCN [Vanilla, 2D Skeleton]
Action RecognitionNTU RGB+DAccuracy (CS)86.6ST-GCN [Vanilla, 3D Skeleton]
Action RecognitionNTU RGB+DAccuracy (CV)93.2ST-GCN [Vanilla, 3D Skeleton]
Action RecognitionNTU RGB+DAccuracy (CS)81.5ST-GCN
Action RecognitionNTU RGB+DAccuracy (CV)88.3ST-GCN

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images2025-06-24Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23