Dong Yang, Monica Mengqi Li, Hong Fu, Jicong Fan, Zhao Zhang, Howard Leung
Combining skeleton structure with graph convolutional networks has achieved remarkable performance in human action recognition. Since current research focuses on designing basic graph for representing skeleton data, these embedding features contain basic topological information, which cannot learn more systematic perspectives from skeleton data. In this paper, we overcome this limitation by proposing a novel framework, which unifies 15 graph embedding features into the graph convolutional network for human action recognition, aiming to best take advantage of graph information to distinguish key joints, bones, and body parts in human action, instead of being exclusive to a single feature or domain. Additionally, we fully investigate how to find the best graph features of skeleton structure for improving human action recognition. Besides, the topological information of the skeleton sequence is explored to further enhance the performance in a multi-stream framework. Moreover, the unified graph features are extracted by the adaptive methods on the training process, which further yields improvements. Our model is validated by three large-scale datasets, namely NTU-RGB+D, Kinetics and SYSU-3D, and outperforms the state-of-the-art methods. Overall, our work unified graph embedding features to promotes systematic research on human action recognition.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-Skeleton dataset | Accuracy | 37.5 | CGCN |
| Video | NTU RGB+D | Accuracy (CS) | 90.3 | CGCN |
| Video | NTU RGB+D | Accuracy (CV) | 96.4 | CGCN |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 37.5 | CGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 90.3 | CGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.4 | CGCN |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 37.5 | CGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 90.3 | CGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.4 | CGCN |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 37.5 | CGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 90.3 | CGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.4 | CGCN |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 37.5 | CGCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 90.3 | CGCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.4 | CGCN |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 37.5 | CGCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 90.3 | CGCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.4 | CGCN |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 37.5 | CGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 90.3 | CGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.4 | CGCN |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 37.5 | CGCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 90.3 | CGCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.4 | CGCN |