TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Disentangling and Unifying Graph Convolutions for Skeleton...

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, Wanli Ouyang

2020-03-31CVPR 2020 63D Action RecognitionSkeleton Based Action RecognitionLong-range modelingAction Recognition
PaperPDFCode(official)CodeCode

Abstract

Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.

Results

TaskDatasetMetricValueModel
VideoAssembly101Actions Top-128.7MS-G3D
VideoAssembly101Object Top-136.3MS-G3D
VideoAssembly101Verbs Top-165.7MS-G3D
VideoKinetics-Skeleton datasetAccuracy38MS-G3D
VideoNTU RGB+DAccuracy (CS)91.5MS-G3D Net
VideoNTU RGB+DAccuracy (CV)96.2MS-G3D Net
Temporal Action LocalizationAssembly101Actions Top-128.7MS-G3D
Temporal Action LocalizationAssembly101Object Top-136.3MS-G3D
Temporal Action LocalizationAssembly101Verbs Top-165.7MS-G3D
Temporal Action LocalizationKinetics-Skeleton datasetAccuracy38MS-G3D
Temporal Action LocalizationNTU RGB+DAccuracy (CS)91.5MS-G3D Net
Temporal Action LocalizationNTU RGB+DAccuracy (CV)96.2MS-G3D Net
Zero-Shot LearningAssembly101Actions Top-128.7MS-G3D
Zero-Shot LearningAssembly101Object Top-136.3MS-G3D
Zero-Shot LearningAssembly101Verbs Top-165.7MS-G3D
Zero-Shot LearningKinetics-Skeleton datasetAccuracy38MS-G3D
Zero-Shot LearningNTU RGB+DAccuracy (CS)91.5MS-G3D Net
Zero-Shot LearningNTU RGB+DAccuracy (CV)96.2MS-G3D Net
Activity RecognitionH2O (2 Hands and Objects)Actions Top-150.83MS-G3D
Activity RecognitionAssembly101Actions Top-128.7MS-G3D
Activity RecognitionAssembly101Object Top-136.3MS-G3D
Activity RecognitionAssembly101Verbs Top-165.7MS-G3D
Activity RecognitionKinetics-Skeleton datasetAccuracy38MS-G3D
Activity RecognitionNTU RGB+DAccuracy (CS)91.5MS-G3D Net
Activity RecognitionNTU RGB+DAccuracy (CV)96.2MS-G3D Net
Action LocalizationAssembly101Actions Top-128.7MS-G3D
Action LocalizationAssembly101Object Top-136.3MS-G3D
Action LocalizationAssembly101Verbs Top-165.7MS-G3D
Action LocalizationKinetics-Skeleton datasetAccuracy38MS-G3D
Action LocalizationNTU RGB+DAccuracy (CS)91.5MS-G3D Net
Action LocalizationNTU RGB+DAccuracy (CV)96.2MS-G3D Net
Action DetectionKinetics-Skeleton datasetAccuracy38MS-G3D
Action DetectionNTU RGB+DAccuracy (CS)91.5MS-G3D Net
Action DetectionNTU RGB+DAccuracy (CV)96.2MS-G3D Net
3D Action RecognitionAssembly101Actions Top-128.7MS-G3D
3D Action RecognitionAssembly101Object Top-136.3MS-G3D
3D Action RecognitionAssembly101Verbs Top-165.7MS-G3D
3D Action RecognitionKinetics-Skeleton datasetAccuracy38MS-G3D
3D Action RecognitionNTU RGB+DAccuracy (CS)91.5MS-G3D Net
3D Action RecognitionNTU RGB+DAccuracy (CV)96.2MS-G3D Net
Action RecognitionH2O (2 Hands and Objects)Actions Top-150.83MS-G3D
Action RecognitionAssembly101Actions Top-128.7MS-G3D
Action RecognitionAssembly101Object Top-136.3MS-G3D
Action RecognitionAssembly101Verbs Top-165.7MS-G3D
Action RecognitionKinetics-Skeleton datasetAccuracy38MS-G3D
Action RecognitionNTU RGB+DAccuracy (CS)91.5MS-G3D Net
Action RecognitionNTU RGB+DAccuracy (CV)96.2MS-G3D Net

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models2025-07-14MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection2025-07-06Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25