Ali Farajzadeh Bavil, Hamed Damirchi, Hamid D. Taghirad
Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spatio-temporal feature extraction is still challenging. In this paper, we introduce Action Capsule which identifies action-related key joints by considering the latent correlation of joints in a skeleton sequence. We show that, during inference, our end-to-end network pays attention to a set of joints specific to each action, whose encoded spatio-temporal features are aggregated to recognize the action. Additionally, the use of multiple stages of action capsules enhances the ability of the network to classify similar actions. Consequently, our network outperforms the state-of-the-art approaches on the N-UCLA dataset and obtains competitive results on the NTURGBD dataset. This is while our approach has significantly lower computational requirements based on GFLOPs measurements.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | N-UCLA | Accuracy | 97.3 | Action Capsules |
| Video | NTU RGB+D | Accuracy (CS) | 90 | Action Capsules |
| Video | NTU RGB+D | Accuracy (CV) | 96.3 | Action Capsules |
| Temporal Action Localization | N-UCLA | Accuracy | 97.3 | Action Capsules |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 90 | Action Capsules |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.3 | Action Capsules |
| Zero-Shot Learning | N-UCLA | Accuracy | 97.3 | Action Capsules |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 90 | Action Capsules |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.3 | Action Capsules |
| Activity Recognition | N-UCLA | Accuracy | 97.3 | Action Capsules |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 90 | Action Capsules |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.3 | Action Capsules |
| Action Localization | N-UCLA | Accuracy | 97.3 | Action Capsules |
| Action Localization | NTU RGB+D | Accuracy (CS) | 90 | Action Capsules |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.3 | Action Capsules |
| Action Detection | N-UCLA | Accuracy | 97.3 | Action Capsules |
| Action Detection | NTU RGB+D | Accuracy (CS) | 90 | Action Capsules |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.3 | Action Capsules |
| 3D Action Recognition | N-UCLA | Accuracy | 97.3 | Action Capsules |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 90 | Action Capsules |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.3 | Action Capsules |
| Action Recognition | N-UCLA | Accuracy | 97.3 | Action Capsules |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 90 | Action Capsules |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.3 | Action Capsules |