Zhenyue Qin, Yang Liu, Pan Ji, Dongwoo Kim, Lei Wang, Bob McKay, Saeed Anwar, Tom Gedeon
Skeleton sequences are lightweight and compact, and thus are ideal candidates for action recognition on edge devices. Recent skeleton-based action recognition methods extract features from 3D joint coordinates as spatial-temporal cues, using these representations in a graph neural network for feature fusion to boost recognition performance. The use of first- and second-order features, i.e., joint and bone representations, has led to high accuracy. Nonetheless, many models are still confused by actions that have similar motion trajectories. To address these issues, we propose fusing higher-order features in the form of angular encoding into modern architectures to robustly capture the relationships between joints and body parts. This simple fusion with popular spatial-temporal graph neural networks achieves new state-of-the-art accuracy in two large benchmarks, including NTU60 and NTU120, while employing fewer parameters and reduced run time. Our source code is publicly available at: https://github.com/ZhenyueQin/Angular-Skeleton-Encoding.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | AngNet-JA + BA + JBA + VJBA |
| Video | NTU RGB+D | Accuracy (CS) | 91.7 | AngNet-JA + BA + JBA + VJBA |
| Video | NTU RGB+D | Accuracy (CV) | 96.4 | AngNet-JA + BA + JBA + VJBA |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | AngNet-JA + BA + JBA + VJBA |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 91.7 | AngNet-JA + BA + JBA + VJBA |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.4 | AngNet-JA + BA + JBA + VJBA |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | AngNet-JA + BA + JBA + VJBA |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 91.7 | AngNet-JA + BA + JBA + VJBA |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.4 | AngNet-JA + BA + JBA + VJBA |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | AngNet-JA + BA + JBA + VJBA |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 91.7 | AngNet-JA + BA + JBA + VJBA |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.4 | AngNet-JA + BA + JBA + VJBA |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | AngNet-JA + BA + JBA + VJBA |
| Action Localization | NTU RGB+D | Accuracy (CS) | 91.7 | AngNet-JA + BA + JBA + VJBA |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.4 | AngNet-JA + BA + JBA + VJBA |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | AngNet-JA + BA + JBA + VJBA |
| Action Detection | NTU RGB+D | Accuracy (CS) | 91.7 | AngNet-JA + BA + JBA + VJBA |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.4 | AngNet-JA + BA + JBA + VJBA |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | AngNet-JA + BA + JBA + VJBA |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 91.7 | AngNet-JA + BA + JBA + VJBA |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.4 | AngNet-JA + BA + JBA + VJBA |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | AngNet-JA + BA + JBA + VJBA |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 91.7 | AngNet-JA + BA + JBA + VJBA |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.4 | AngNet-JA + BA + JBA + VJBA |