TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Multi-Granular Spatio-Temporal Graph Network for ...

Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

Tailin Chen, Desen Zhou, Jian Wang, Shidong Wang, Yu Guan, Xuming He, Errui Ding

2021-08-10Action ClassificationSkeleton Based Action RecognitionScene UnderstandingAction Recognition
PaperPDFCode(official)

Abstract

The task of skeleton-based action recognition remains a core challenge in human-centred scene understanding due to the multiple granularities and large variation in human motion. Existing approaches typically employ a single neural representation for different motion patterns, which has difficulty in capturing fine-grained action classes given limited training data. To address the aforementioned problems, we propose a novel multi-granular spatio-temporal graph network for skeleton-based action classification that jointly models the coarse- and fine-grained skeleton motion patterns. To this end, we develop a dual-head graph network consisting of two interleaved branches, which enables us to extract features at two spatio-temporal resolutions in an effective and efficient manner. Moreover, our network utilises a cross-head communication strategy to mutually enhance the representations of both heads. We conducted extensive experiments on three large-scale datasets, namely NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton, and achieves the state-of-the-art performance on all the benchmarks, which validates the effectiveness of our method.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)89.3DualHead-Net
VideoNTU RGB+D 120Accuracy (Cross-Subject)88.2DualHead-Net
VideoNTU RGB+D 120Ensembled Modalities4DualHead-Net
VideoKinetics-Skeleton datasetAccuracy38.4DualHead-Net
VideoNTU RGB+DAccuracy (CS)92DualHead-Net
VideoNTU RGB+DAccuracy (CV)96.6DualHead-Net
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)89.3DualHead-Net
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)88.2DualHead-Net
Temporal Action LocalizationNTU RGB+D 120Ensembled Modalities4DualHead-Net
Temporal Action LocalizationKinetics-Skeleton datasetAccuracy38.4DualHead-Net
Temporal Action LocalizationNTU RGB+DAccuracy (CS)92DualHead-Net
Temporal Action LocalizationNTU RGB+DAccuracy (CV)96.6DualHead-Net
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)89.3DualHead-Net
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)88.2DualHead-Net
Zero-Shot LearningNTU RGB+D 120Ensembled Modalities4DualHead-Net
Zero-Shot LearningKinetics-Skeleton datasetAccuracy38.4DualHead-Net
Zero-Shot LearningNTU RGB+DAccuracy (CS)92DualHead-Net
Zero-Shot LearningNTU RGB+DAccuracy (CV)96.6DualHead-Net
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)89.3DualHead-Net
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)88.2DualHead-Net
Activity RecognitionNTU RGB+D 120Ensembled Modalities4DualHead-Net
Activity RecognitionKinetics-Skeleton datasetAccuracy38.4DualHead-Net
Activity RecognitionNTU RGB+DAccuracy (CS)92DualHead-Net
Activity RecognitionNTU RGB+DAccuracy (CV)96.6DualHead-Net
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)89.3DualHead-Net
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)88.2DualHead-Net
Action LocalizationNTU RGB+D 120Ensembled Modalities4DualHead-Net
Action LocalizationKinetics-Skeleton datasetAccuracy38.4DualHead-Net
Action LocalizationNTU RGB+DAccuracy (CS)92DualHead-Net
Action LocalizationNTU RGB+DAccuracy (CV)96.6DualHead-Net
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)89.3DualHead-Net
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)88.2DualHead-Net
Action DetectionNTU RGB+D 120Ensembled Modalities4DualHead-Net
Action DetectionKinetics-Skeleton datasetAccuracy38.4DualHead-Net
Action DetectionNTU RGB+DAccuracy (CS)92DualHead-Net
Action DetectionNTU RGB+DAccuracy (CV)96.6DualHead-Net
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)89.3DualHead-Net
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)88.2DualHead-Net
3D Action RecognitionNTU RGB+D 120Ensembled Modalities4DualHead-Net
3D Action RecognitionKinetics-Skeleton datasetAccuracy38.4DualHead-Net
3D Action RecognitionNTU RGB+DAccuracy (CS)92DualHead-Net
3D Action RecognitionNTU RGB+DAccuracy (CV)96.6DualHead-Net
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)89.3DualHead-Net
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)88.2DualHead-Net
Action RecognitionNTU RGB+D 120Ensembled Modalities4DualHead-Net
Action RecognitionKinetics-Skeleton datasetAccuracy38.4DualHead-Net
Action RecognitionNTU RGB+DAccuracy (CS)92DualHead-Net
Action RecognitionNTU RGB+DAccuracy (CV)96.6DualHead-Net

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation2025-07-15Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander2025-07-15Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14