Tailin Chen, Desen Zhou, Jian Wang, Shidong Wang, Yu Guan, Xuming He, Errui Ding
The task of skeleton-based action recognition remains a core challenge in human-centred scene understanding due to the multiple granularities and large variation in human motion. Existing approaches typically employ a single neural representation for different motion patterns, which has difficulty in capturing fine-grained action classes given limited training data. To address the aforementioned problems, we propose a novel multi-granular spatio-temporal graph network for skeleton-based action classification that jointly models the coarse- and fine-grained skeleton motion patterns. To this end, we develop a dual-head graph network consisting of two interleaved branches, which enables us to extract features at two spatio-temporal resolutions in an effective and efficient manner. Moreover, our network utilises a cross-head communication strategy to mutually enhance the representations of both heads. We conducted extensive experiments on three large-scale datasets, namely NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton, and achieves the state-of-the-art performance on all the benchmarks, which validates the effectiveness of our method.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.3 | DualHead-Net |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.2 | DualHead-Net |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | DualHead-Net |
| Video | Kinetics-Skeleton dataset | Accuracy | 38.4 | DualHead-Net |
| Video | NTU RGB+D | Accuracy (CS) | 92 | DualHead-Net |
| Video | NTU RGB+D | Accuracy (CV) | 96.6 | DualHead-Net |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.3 | DualHead-Net |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.2 | DualHead-Net |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | DualHead-Net |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 38.4 | DualHead-Net |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 92 | DualHead-Net |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.6 | DualHead-Net |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.3 | DualHead-Net |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.2 | DualHead-Net |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | DualHead-Net |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 38.4 | DualHead-Net |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 92 | DualHead-Net |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.6 | DualHead-Net |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.3 | DualHead-Net |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.2 | DualHead-Net |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DualHead-Net |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 38.4 | DualHead-Net |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 92 | DualHead-Net |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.6 | DualHead-Net |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.3 | DualHead-Net |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.2 | DualHead-Net |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | DualHead-Net |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 38.4 | DualHead-Net |
| Action Localization | NTU RGB+D | Accuracy (CS) | 92 | DualHead-Net |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.6 | DualHead-Net |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.3 | DualHead-Net |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.2 | DualHead-Net |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | DualHead-Net |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 38.4 | DualHead-Net |
| Action Detection | NTU RGB+D | Accuracy (CS) | 92 | DualHead-Net |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.6 | DualHead-Net |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.3 | DualHead-Net |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.2 | DualHead-Net |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DualHead-Net |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 38.4 | DualHead-Net |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 92 | DualHead-Net |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.6 | DualHead-Net |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.3 | DualHead-Net |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.2 | DualHead-Net |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | DualHead-Net |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 38.4 | DualHead-Net |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 92 | DualHead-Net |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.6 | DualHead-Net |