Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu
Graph convolutional networks (GCNs), which generalize CNNs to more generic non-Euclidean structures, have achieved remarkable performance for skeleton-based action recognition. However, there still exist several issues in the previous GCN-based models. First, the topology of the graph is set heuristically and fixed over all the model layers and input data. This may not be suitable for the hierarchy of the GCN model and the diversity of the data in action recognition tasks. Second, the second-order information of the skeleton data, i.e., the length and orientation of the bones, is rarely investigated, which is naturally more informative and discriminative for the human action recognition. In this work, we propose a novel multi-stream attention-enhanced adaptive graph convolutional neural network (MS-AAGCN) for skeleton-based action recognition. The graph topology in our model can be either uniformly or individually learned based on the input data in an end-to-end manner. This data-driven approach increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples. Besides, the proposed adaptive graph convolutional layer is further enhanced by a spatial-temporal-channel attention module, which helps the model pay more attention to important joints, frames and features. Moreover, the information of both the joints and bones, together with their motion information, are simultaneously modeled in a multi-stream framework, which shows notable improvement for the recognition accuracy. Extensive experiments on the two large-scale datasets, NTU-RGBD and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the state-of-the-art with a significant margin.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-Skeleton dataset | Accuracy | 37.8 | MS-AAGCN |
| Video | Kinetics-Skeleton dataset | Accuracy | 37.4 | JB-AAGCN |
| Video | NTU RGB+D | Accuracy (CS) | 90 | MS-AAGCN |
| Video | NTU RGB+D | Accuracy (CV) | 96.2 | MS-AAGCN |
| Video | NTU RGB+D | Accuracy (CS) | 89.4 | JB-AAGCN |
| Video | NTU RGB+D | Accuracy (CV) | 96 | JB-AAGCN |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 37.8 | MS-AAGCN |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 37.4 | JB-AAGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 90 | MS-AAGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.2 | MS-AAGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 89.4 | JB-AAGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96 | JB-AAGCN |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 37.8 | MS-AAGCN |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 37.4 | JB-AAGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 90 | MS-AAGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.2 | MS-AAGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 89.4 | JB-AAGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96 | JB-AAGCN |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 37.8 | MS-AAGCN |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 37.4 | JB-AAGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 90 | MS-AAGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.2 | MS-AAGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 89.4 | JB-AAGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96 | JB-AAGCN |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 37.8 | MS-AAGCN |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 37.4 | JB-AAGCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 90 | MS-AAGCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.2 | MS-AAGCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 89.4 | JB-AAGCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96 | JB-AAGCN |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 37.8 | MS-AAGCN |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 37.4 | JB-AAGCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 90 | MS-AAGCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.2 | MS-AAGCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 89.4 | JB-AAGCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96 | JB-AAGCN |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 37.8 | MS-AAGCN |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 37.4 | JB-AAGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 90 | MS-AAGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.2 | MS-AAGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 89.4 | JB-AAGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96 | JB-AAGCN |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 37.8 | MS-AAGCN |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 37.4 | JB-AAGCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 90 | MS-AAGCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.2 | MS-AAGCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 89.4 | JB-AAGCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96 | JB-AAGCN |