Hu Cui, Renjing Huang, Ruoyu Zhang, Tessai Hayama
Graph convolutional networks (GCNs) have emerged as a powerful tool for skeleton-based action and gesture recognition, thanks to their ability to model spatial and temporal dependencies in skeleton data. However, existing GCN-based methods face critical limitations: (1) they lack effective spatio-temporal topology modeling that captures dynamic variations in skeletal motion, and (2) they struggle to model multiscale structural relationships beyond local joint connectivity. To address these issues, we propose a novel framework called Dynamic Spatial-Temporal Semantic Awareness Graph Convolutional Network (DSTSA-GCN). DSTSA-GCN introduces three key modules: Group Channel-wise Graph Convolution (GC-GC), Group Temporal-wise Graph Convolution (GT-GC), and Multi-Scale Temporal Convolution (MS-TCN). GC-GC and GT-GC operate in parallel to independently model channel-specific and frame-specific correlations, enabling robust topology learning that accounts for temporal variations. Additionally, both modules employ a grouping strategy to adaptively capture multiscale structural relationships. Complementing this, MS-TCN enhances temporal modeling through group-wise temporal convolutions with diverse receptive fields. Extensive experiments demonstrate that DSTSA-GCN significantly improves the topology modeling capabilities of GCNs, achieving state-of-the-art performance on benchmark datasets for gesture and action recognition, including SHREC17 Track, DHG-14\/28, NTU-RGB+D, and NTU-RGB+D-120.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | DSTSA-GCN |
| Video | SHREC 2017 track on 3D Hand Gesture Recognition | 14 gestures accuracy | 97.74 | DSTSA-GCN |
| Video | SHREC 2017 track on 3D Hand Gesture Recognition | 28 gestures accuracy | 95.37 | DSTSA-GCN |
| Video | N-UCLA | Accuracy | 96.98 | DSTSA-GCN |
| Video | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Video | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Video | NTU RGB+D | Ensembled Modalities | 4 | DSTSA-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | DSTSA-GCN |
| Temporal Action Localization | SHREC 2017 track on 3D Hand Gesture Recognition | 14 gestures accuracy | 97.74 | DSTSA-GCN |
| Temporal Action Localization | SHREC 2017 track on 3D Hand Gesture Recognition | 28 gestures accuracy | 95.37 | DSTSA-GCN |
| Temporal Action Localization | N-UCLA | Accuracy | 96.98 | DSTSA-GCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 4 | DSTSA-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | DSTSA-GCN |
| Zero-Shot Learning | SHREC 2017 track on 3D Hand Gesture Recognition | 14 gestures accuracy | 97.74 | DSTSA-GCN |
| Zero-Shot Learning | SHREC 2017 track on 3D Hand Gesture Recognition | 28 gestures accuracy | 95.37 | DSTSA-GCN |
| Zero-Shot Learning | N-UCLA | Accuracy | 96.98 | DSTSA-GCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 4 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DSTSA-GCN |
| Activity Recognition | SHREC 2017 track on 3D Hand Gesture Recognition | 14 gestures accuracy | 97.74 | DSTSA-GCN |
| Activity Recognition | SHREC 2017 track on 3D Hand Gesture Recognition | 28 gestures accuracy | 95.37 | DSTSA-GCN |
| Activity Recognition | N-UCLA | Accuracy | 96.98 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 4 | DSTSA-GCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | DSTSA-GCN |
| Action Localization | SHREC 2017 track on 3D Hand Gesture Recognition | 14 gestures accuracy | 97.74 | DSTSA-GCN |
| Action Localization | SHREC 2017 track on 3D Hand Gesture Recognition | 28 gestures accuracy | 95.37 | DSTSA-GCN |
| Action Localization | N-UCLA | Accuracy | 96.98 | DSTSA-GCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Action Localization | NTU RGB+D | Ensembled Modalities | 4 | DSTSA-GCN |
| Hand | DHG-28 | Accuracy | 93.57 | DSTSA-GCN |
| Hand | DHG-14 | Accuracy | 95.04 | DSTSA-GCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | DSTSA-GCN |
| Action Detection | SHREC 2017 track on 3D Hand Gesture Recognition | 14 gestures accuracy | 97.74 | DSTSA-GCN |
| Action Detection | SHREC 2017 track on 3D Hand Gesture Recognition | 28 gestures accuracy | 95.37 | DSTSA-GCN |
| Action Detection | N-UCLA | Accuracy | 96.98 | DSTSA-GCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Action Detection | NTU RGB+D | Ensembled Modalities | 4 | DSTSA-GCN |
| Gesture Recognition | DHG-28 | Accuracy | 93.57 | DSTSA-GCN |
| Gesture Recognition | DHG-14 | Accuracy | 95.04 | DSTSA-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DSTSA-GCN |
| 3D Action Recognition | SHREC 2017 track on 3D Hand Gesture Recognition | 14 gestures accuracy | 97.74 | DSTSA-GCN |
| 3D Action Recognition | SHREC 2017 track on 3D Hand Gesture Recognition | 28 gestures accuracy | 95.37 | DSTSA-GCN |
| 3D Action Recognition | N-UCLA | Accuracy | 96.98 | DSTSA-GCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | DSTSA-GCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.97 | DSTSA-GCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.12 | DSTSA-GCN |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DSTSA-GCN |
| Action Recognition | SHREC 2017 track on 3D Hand Gesture Recognition | 14 gestures accuracy | 97.74 | DSTSA-GCN |
| Action Recognition | SHREC 2017 track on 3D Hand Gesture Recognition | 28 gestures accuracy | 95.37 | DSTSA-GCN |
| Action Recognition | N-UCLA | Accuracy | 96.98 | DSTSA-GCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 92.78 | DSTSA-GCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 97.03 | DSTSA-GCN |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | DSTSA-GCN |