Hongda Liu, Yunfan Liu, Min Ren, Hao Wang, Yunlong Wang, Zhenan Sun
In skeleton-based action recognition, a key challenge is distinguishing between actions with similar trajectories of joints due to the lack of image-level details in skeletal representations. Recognizing that the differentiation of similar actions relies on subtle motion details in specific body parts, we direct our approach to focus on the fine-grained motion of local skeleton components. To this end, we introduce ProtoGCN, a Graph Convolutional Network (GCN)-based model that breaks down the dynamics of entire skeleton sequences into a combination of learnable prototypes representing core motion patterns of action units. By contrasting the reconstruction of prototypes, ProtoGCN can effectively identify and enhance the discriminative representation of similar actions. Without bells and whistles, ProtoGCN achieves state-of-the-art performance on multiple benchmark datasets, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, and FineGYM, which demonstrates the effectiveness of the proposed method. The code is available at https://github.com/firework8/ProtoGCN.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 92.2 | ProtoGCN |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.9 | ProtoGCN |
| Video | NTU RGB+D 120 | Ensembled Modalities | 6 | ProtoGCN |
| Video | Kinetics-Skeleton dataset | Accuracy | 51.9 | ProtoGCN |
| Video | NTU RGB+D | Accuracy (CS) | 93.8 | ProtoGCN |
| Video | NTU RGB+D | Accuracy (CV) | 97.8 | ProtoGCN |
| Video | NTU RGB+D | Ensembled Modalities | 6 | ProtoGCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 92.2 | ProtoGCN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.9 | ProtoGCN |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 6 | ProtoGCN |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 51.9 | ProtoGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 93.8 | ProtoGCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 97.8 | ProtoGCN |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 6 | ProtoGCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 92.2 | ProtoGCN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.9 | ProtoGCN |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 6 | ProtoGCN |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 51.9 | ProtoGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 93.8 | ProtoGCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 97.8 | ProtoGCN |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 6 | ProtoGCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 92.2 | ProtoGCN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.9 | ProtoGCN |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 6 | ProtoGCN |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 51.9 | ProtoGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 93.8 | ProtoGCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 97.8 | ProtoGCN |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 6 | ProtoGCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 92.2 | ProtoGCN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.9 | ProtoGCN |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 6 | ProtoGCN |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 51.9 | ProtoGCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 93.8 | ProtoGCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 97.8 | ProtoGCN |
| Action Localization | NTU RGB+D | Ensembled Modalities | 6 | ProtoGCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 92.2 | ProtoGCN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.9 | ProtoGCN |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 6 | ProtoGCN |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 51.9 | ProtoGCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 93.8 | ProtoGCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 97.8 | ProtoGCN |
| Action Detection | NTU RGB+D | Ensembled Modalities | 6 | ProtoGCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 92.2 | ProtoGCN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.9 | ProtoGCN |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 6 | ProtoGCN |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 51.9 | ProtoGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 93.8 | ProtoGCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 97.8 | ProtoGCN |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 6 | ProtoGCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 92.2 | ProtoGCN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.9 | ProtoGCN |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 6 | ProtoGCN |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 51.9 | ProtoGCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 93.8 | ProtoGCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 97.8 | ProtoGCN |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 6 | ProtoGCN |