Shengqin Wang, Yongji Zhang, Minghao Zhao, Hong Qi, Kai Wang, Fenglin Wei, Yu Jiang
Skeleton-based action recognition methods are limited by the semantic extraction of spatio-temporal skeletal maps. However, current methods have difficulty in effectively combining features from both temporal and spatial graph dimensions and tend to be thick on one side and thin on the other. In this paper, we propose a Temporal-Channel Aggregation Graph Convolutional Networks (TCA-GCN) to learn spatial and temporal topologies dynamically and efficiently aggregate topological features in different temporal and channel dimensions for skeleton-based action recognition. We use the Temporal Aggregation module to learn temporal dimensional features and the Channel Aggregation module to efficiently combine spatial dynamic channel-wise topological features with temporal dynamic topological features. In addition, we extract multi-scale skeletal features on temporal modeling and fuse them with an attention mechanism. Extensive experiments show that our model results outperform state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.8 | TCA-GCN |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.4 | TCA-GCN |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | TCA-GCN |
| Video | N-UCLA | Accuracy | 97 | TCA-GCN |
| Video | NTU RGB+D | Accuracy (CS) | 92.8 | TCA-GCN |
| Video | NTU RGB+D | Accuracy (CV) | 97 | TCA-GCN |
| Video | NTU RGB+D | Ensembled Modalities | 4 | TCA-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.8 | TCA-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.4 | TCA-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | TCA-GCN |
| Temporal Action Localization | N-UCLA | Accuracy | 97 | TCA-GCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 92.8 | TCA-GCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 97 | TCA-GCN |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 4 | TCA-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.8 | TCA-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.4 | TCA-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | TCA-GCN |
| Zero-Shot Learning | N-UCLA | Accuracy | 97 | TCA-GCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 92.8 | TCA-GCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 97 | TCA-GCN |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 4 | TCA-GCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.8 | TCA-GCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.4 | TCA-GCN |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TCA-GCN |
| Activity Recognition | N-UCLA | Accuracy | 97 | TCA-GCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 92.8 | TCA-GCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 97 | TCA-GCN |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 4 | TCA-GCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.8 | TCA-GCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.4 | TCA-GCN |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | TCA-GCN |
| Action Localization | N-UCLA | Accuracy | 97 | TCA-GCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 92.8 | TCA-GCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 97 | TCA-GCN |
| Action Localization | NTU RGB+D | Ensembled Modalities | 4 | TCA-GCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.8 | TCA-GCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.4 | TCA-GCN |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | TCA-GCN |
| Action Detection | N-UCLA | Accuracy | 97 | TCA-GCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 92.8 | TCA-GCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 97 | TCA-GCN |
| Action Detection | NTU RGB+D | Ensembled Modalities | 4 | TCA-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.8 | TCA-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.4 | TCA-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TCA-GCN |
| 3D Action Recognition | N-UCLA | Accuracy | 97 | TCA-GCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 92.8 | TCA-GCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 97 | TCA-GCN |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | TCA-GCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.8 | TCA-GCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.4 | TCA-GCN |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | TCA-GCN |
| Action Recognition | N-UCLA | Accuracy | 97 | TCA-GCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 92.8 | TCA-GCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 97 | TCA-GCN |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | TCA-GCN |