Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin
Graph convolution networks (GCN) have been widely used in skeleton-based action recognition. We note that existing GCN-based approaches primarily rely on prescribed graphical structures (ie., a manually defined topology of skeleton joints), which limits their flexibility to capture complicated correlations between joints. To move beyond this limitation, we propose a new framework for skeleton-based action recognition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN). It consists of two modules, DG-GCN and DG-TCN, respectively, for spatial and temporal modeling. In particular, DG-GCN uses learned affinity matrices to capture dynamic graphical structures instead of relying on a prescribed one, while DG-TCN performs group-wise temporal convolutions with varying receptive fields and incorporates a dynamic joint-skeleton fusion module for adaptive multi-level temporal modeling. On a wide range of benchmarks, including NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome, DG-STGCN consistently outperforms state-of-the-art methods, often by a notable margin.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.3 | DG-STGCN |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.6 | DG-STGCN |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | DG-STGCN |
| Video | NTU RGB+D | Accuracy (CS) | 93.2 | DG-STGCN |
| Video | NTU RGB+D | Accuracy (CV) | 97.5 | DG-STGCN |
| Video | NTU RGB+D | Ensembled Modalities | 4 | DG-STGCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.3 | DG-STGCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.6 | DG-STGCN |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | DG-STGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 93.2 | DG-STGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 97.5 | DG-STGCN |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 4 | DG-STGCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.3 | DG-STGCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.6 | DG-STGCN |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | DG-STGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 93.2 | DG-STGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 97.5 | DG-STGCN |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 4 | DG-STGCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.3 | DG-STGCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.6 | DG-STGCN |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DG-STGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 93.2 | DG-STGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 97.5 | DG-STGCN |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 4 | DG-STGCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.3 | DG-STGCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.6 | DG-STGCN |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | DG-STGCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 93.2 | DG-STGCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 97.5 | DG-STGCN |
| Action Localization | NTU RGB+D | Ensembled Modalities | 4 | DG-STGCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.3 | DG-STGCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.6 | DG-STGCN |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | DG-STGCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 93.2 | DG-STGCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 97.5 | DG-STGCN |
| Action Detection | NTU RGB+D | Ensembled Modalities | 4 | DG-STGCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.3 | DG-STGCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.6 | DG-STGCN |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DG-STGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 93.2 | DG-STGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 97.5 | DG-STGCN |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | DG-STGCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.3 | DG-STGCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.6 | DG-STGCN |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DG-STGCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 93.2 | DG-STGCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 97.5 | DG-STGCN |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | DG-STGCN |