Dongjingdin Liu, Pengpeng Chen, Miao Yao, Yijing Lu, Zijie Cai, Yuxin Tian
Skeleton-based action recognition has achieved remarkable results in human action recognition with the development of graph convolutional networks (GCNs). However, the recent works tend to construct complex learning mechanisms with redundant training and exist a bottleneck for long time-series. To solve these problems, we propose the Temporal-Spatio Graph ConvNeXt (TSGCNeXt) to explore efficient learning mechanism of long temporal skeleton sequences. Firstly, a new graph learning mechanism with simple structure, Dynamic-Static Separate Multi-graph Convolution (DS-SMG) is proposed to aggregate features of multiple independent topological graphs and avoid the node information being ignored during dynamic convolution. Next, we construct a graph convolution training acceleration mechanism to optimize the back-propagation computing of dynamic graph learning with 55.08\% speed-up. Finally, the TSGCNeXt restructure the overall structure of GCN with three Spatio-temporal learning modules,efficiently modeling long temporal features. In comparison with existing previous methods on large-scale datasets NTU RGB+D 60 and 120, TSGCNeXt outperforms on single-stream networks. In addition, with the ema model introduced into the multi-stream fusion, TSGCNeXt achieves SOTA levels. On the cross-subject and cross-set of the NTU 120, accuracies reach 90.22% and 91.74%.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | TSGCNeXt |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.2 | TSGCNeXt |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXt |
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | TSGCNeXT |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.1 | TSGCNeXT |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXT |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | TSGCNeXt |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.2 | TSGCNeXt |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXt |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | TSGCNeXT |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.1 | TSGCNeXT |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXT |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | TSGCNeXt |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.2 | TSGCNeXt |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXt |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | TSGCNeXT |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.1 | TSGCNeXT |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXT |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | TSGCNeXt |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.2 | TSGCNeXt |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXt |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | TSGCNeXT |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.1 | TSGCNeXT |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXT |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | TSGCNeXt |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.2 | TSGCNeXt |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXt |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | TSGCNeXT |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.1 | TSGCNeXT |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXT |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | TSGCNeXt |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.2 | TSGCNeXt |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXt |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | TSGCNeXT |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.1 | TSGCNeXT |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXT |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | TSGCNeXt |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.2 | TSGCNeXt |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXt |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | TSGCNeXT |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.1 | TSGCNeXT |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXT |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | TSGCNeXt |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.2 | TSGCNeXt |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXt |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | TSGCNeXT |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.1 | TSGCNeXT |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TSGCNeXT |