Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, Weiming Hu
Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. In GCNs, graph topology dominates feature aggregation and therefore is the key to extracting representative features. In this work, we propose a novel Channel-wise Topology Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies and effectively aggregate joint features in different channels for skeleton-based action recognition. The proposed CTR-GC models channel-wise topologies through learning a shared topology as a generic prior for all channels and refining it with channel-specific correlations for each channel. Our refinement method introduces few extra parameters and significantly reduces the difficulty of modeling channel-wise topologies. Furthermore, via reformulating graph convolutions into a unified form, we find that CTR-GC relaxes strict constraints of graph convolutions, leading to stronger representation capability. Combining CTR-GC with temporal modeling modules, we develop a powerful graph convolutional network named CTR-GCN which notably outperforms state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.6 | CTR-GCN |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.9 | CTR-GCN |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | CTR-GCN |
| Video | N-UCLA | Accuracy | 96.5 | CTR-GCN |
| Video | NTU RGB+D | Accuracy (CS) | 92.4 | CTR-GCN |
| Video | NTU RGB+D | Accuracy (CV) | 96.8 | CTR-GCN |
| Video | NTU RGB+D | Ensembled Modalities | 4 | CTR-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.6 | CTR-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.9 | CTR-GCN |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | CTR-GCN |
| Temporal Action Localization | N-UCLA | Accuracy | 96.5 | CTR-GCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 92.4 | CTR-GCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.8 | CTR-GCN |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 4 | CTR-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.6 | CTR-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.9 | CTR-GCN |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | CTR-GCN |
| Zero-Shot Learning | N-UCLA | Accuracy | 96.5 | CTR-GCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 92.4 | CTR-GCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.8 | CTR-GCN |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 4 | CTR-GCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.6 | CTR-GCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.9 | CTR-GCN |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | CTR-GCN |
| Activity Recognition | N-UCLA | Accuracy | 96.5 | CTR-GCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 92.4 | CTR-GCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.8 | CTR-GCN |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 4 | CTR-GCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.6 | CTR-GCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.9 | CTR-GCN |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | CTR-GCN |
| Action Localization | N-UCLA | Accuracy | 96.5 | CTR-GCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 92.4 | CTR-GCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.8 | CTR-GCN |
| Action Localization | NTU RGB+D | Ensembled Modalities | 4 | CTR-GCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.6 | CTR-GCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.9 | CTR-GCN |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | CTR-GCN |
| Action Detection | N-UCLA | Accuracy | 96.5 | CTR-GCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 92.4 | CTR-GCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.8 | CTR-GCN |
| Action Detection | NTU RGB+D | Ensembled Modalities | 4 | CTR-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.6 | CTR-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.9 | CTR-GCN |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | CTR-GCN |
| 3D Action Recognition | N-UCLA | Accuracy | 96.5 | CTR-GCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 92.4 | CTR-GCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.8 | CTR-GCN |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | CTR-GCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.6 | CTR-GCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 88.9 | CTR-GCN |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | CTR-GCN |
| Action Recognition | N-UCLA | Accuracy | 96.5 | CTR-GCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 92.4 | CTR-GCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.8 | CTR-GCN |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | CTR-GCN |