Sheng-Wei Li, Zi-Xiang Wei, Wei-Jie Chen, Yi-Hsin Yu, Chih-Yuan Yang, Jane Yung-jen Hsu
Existing zero-shot skeleton-based action recognition methods utilize projection networks to learn a shared latent space of skeleton features and semantic embeddings. The inherent imbalance in action recognition datasets, characterized by variable skeleton sequences yet constant class labels, presents significant challenges for alignment. To address the imbalance, we propose SA-DVAE -- Semantic Alignment via Disentangled Variational Autoencoders, a method that first adopts feature disentanglement to separate skeleton features into two independent parts -- one is semantic-related and another is irrelevant -- to better align skeleton and semantic features. We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty. We conduct experiments on three benchmark datasets: NTU RGB+D, NTU RGB+D 120 and PKU-MMD, and our experimental results show that SA-DAVE produces improved performance over existing methods. The code is available at https://github.com/pha123661/SA-DVAE.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (10 unseen classes) | 68.77 | SA-DVAE |
| Video | NTU RGB+D 120 | Accuracy (24 unseen classes) | 46.12 | SA-DVAE |
| Video | NTU RGB+D 120 | Random Split Accuracy | 50.67 | SA-DVAE |
| Video | NTU RGB+D 120 | Random Split Accuracy | 57.16 | SA-DVAE + augmented text |
| Video | PKU-MMD | Random Split Accuracy | 66.54 | SA-DVAE |
| Video | NTU RGB+D | Accuracy (12 unseen classes) | 41.38 | SA-DVAE |
| Video | NTU RGB+D | Accuracy (5 unseen classes) | 82.37 | SA-DVAE |
| Video | NTU RGB+D | Random Split Accuracy | 84.2 | SA-DVAE |
| Video | NTU RGB+D | Random Split Accuracy | 87.61 | SA-DVAE + augmented text |
| Video | NTU RGB+D | Harmonic Mean (12 unseen classes) | 42.56 | SA-DVAE |
| Video | NTU RGB+D | Harmonic Mean (5 unseen classes) | 66.27 | SA-DVAE |
| Video | NTU RGB+D | Random Split Harmonic Mean | 75.27 | SA-DVAE |
| Video | NTU RGB+D | Random Split Harmonic Mean | 75.51 | SA-DVAE + augmented text |
| Video | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 60.42 | SA-DVAE |
| Video | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 44.5 | SA-DVAE |
| Video | NTU RGB+D 120 | Random Split Harmonic Mean | 47.54 | SA-DVAE |
| Video | NTU RGB+D 120 | Random Split Harmonic Mean | 50.72 | SA-DVAE + augmented text |
| Video | PKU-MMD | Random Split Harmonic Mean | 54.72 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (10 unseen classes) | 68.77 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (24 unseen classes) | 46.12 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D 120 | Random Split Accuracy | 50.67 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D 120 | Random Split Accuracy | 57.16 | SA-DVAE + augmented text |
| Temporal Action Localization | PKU-MMD | Random Split Accuracy | 66.54 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D | Accuracy (12 unseen classes) | 41.38 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D | Accuracy (5 unseen classes) | 82.37 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D | Random Split Accuracy | 84.2 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D | Random Split Accuracy | 87.61 | SA-DVAE + augmented text |
| Temporal Action Localization | NTU RGB+D | Harmonic Mean (12 unseen classes) | 42.56 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D | Harmonic Mean (5 unseen classes) | 66.27 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D | Random Split Harmonic Mean | 75.27 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D | Random Split Harmonic Mean | 75.51 | SA-DVAE + augmented text |
| Temporal Action Localization | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 60.42 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 44.5 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D 120 | Random Split Harmonic Mean | 47.54 | SA-DVAE |
| Temporal Action Localization | NTU RGB+D 120 | Random Split Harmonic Mean | 50.72 | SA-DVAE + augmented text |
| Temporal Action Localization | PKU-MMD | Random Split Harmonic Mean | 54.72 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (10 unseen classes) | 68.77 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (24 unseen classes) | 46.12 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D 120 | Random Split Accuracy | 50.67 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D 120 | Random Split Accuracy | 57.16 | SA-DVAE + augmented text |
| Zero-Shot Learning | PKU-MMD | Random Split Accuracy | 66.54 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D | Accuracy (12 unseen classes) | 41.38 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D | Accuracy (5 unseen classes) | 82.37 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D | Random Split Accuracy | 84.2 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D | Random Split Accuracy | 87.61 | SA-DVAE + augmented text |
| Zero-Shot Learning | NTU RGB+D | Harmonic Mean (12 unseen classes) | 42.56 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D | Harmonic Mean (5 unseen classes) | 66.27 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D | Random Split Harmonic Mean | 75.27 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D | Random Split Harmonic Mean | 75.51 | SA-DVAE + augmented text |
| Zero-Shot Learning | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 60.42 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 44.5 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D 120 | Random Split Harmonic Mean | 47.54 | SA-DVAE |
| Zero-Shot Learning | NTU RGB+D 120 | Random Split Harmonic Mean | 50.72 | SA-DVAE + augmented text |
| Zero-Shot Learning | PKU-MMD | Random Split Harmonic Mean | 54.72 | SA-DVAE |
| Activity Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 68.77 | SA-DVAE |
| Activity Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 46.12 | SA-DVAE |
| Activity Recognition | NTU RGB+D 120 | Random Split Accuracy | 50.67 | SA-DVAE |
| Activity Recognition | NTU RGB+D 120 | Random Split Accuracy | 57.16 | SA-DVAE + augmented text |
| Activity Recognition | PKU-MMD | Random Split Accuracy | 66.54 | SA-DVAE |
| Activity Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 41.38 | SA-DVAE |
| Activity Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 82.37 | SA-DVAE |
| Activity Recognition | NTU RGB+D | Random Split Accuracy | 84.2 | SA-DVAE |
| Activity Recognition | NTU RGB+D | Random Split Accuracy | 87.61 | SA-DVAE + augmented text |
| Activity Recognition | NTU RGB+D | Harmonic Mean (12 unseen classes) | 42.56 | SA-DVAE |
| Activity Recognition | NTU RGB+D | Harmonic Mean (5 unseen classes) | 66.27 | SA-DVAE |
| Activity Recognition | NTU RGB+D | Random Split Harmonic Mean | 75.27 | SA-DVAE |
| Activity Recognition | NTU RGB+D | Random Split Harmonic Mean | 75.51 | SA-DVAE + augmented text |
| Activity Recognition | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 60.42 | SA-DVAE |
| Activity Recognition | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 44.5 | SA-DVAE |
| Activity Recognition | NTU RGB+D 120 | Random Split Harmonic Mean | 47.54 | SA-DVAE |
| Activity Recognition | NTU RGB+D 120 | Random Split Harmonic Mean | 50.72 | SA-DVAE + augmented text |
| Activity Recognition | PKU-MMD | Random Split Harmonic Mean | 54.72 | SA-DVAE |
| Action Localization | NTU RGB+D 120 | Accuracy (10 unseen classes) | 68.77 | SA-DVAE |
| Action Localization | NTU RGB+D 120 | Accuracy (24 unseen classes) | 46.12 | SA-DVAE |
| Action Localization | NTU RGB+D 120 | Random Split Accuracy | 50.67 | SA-DVAE |
| Action Localization | NTU RGB+D 120 | Random Split Accuracy | 57.16 | SA-DVAE + augmented text |
| Action Localization | PKU-MMD | Random Split Accuracy | 66.54 | SA-DVAE |
| Action Localization | NTU RGB+D | Accuracy (12 unseen classes) | 41.38 | SA-DVAE |
| Action Localization | NTU RGB+D | Accuracy (5 unseen classes) | 82.37 | SA-DVAE |
| Action Localization | NTU RGB+D | Random Split Accuracy | 84.2 | SA-DVAE |
| Action Localization | NTU RGB+D | Random Split Accuracy | 87.61 | SA-DVAE + augmented text |
| Action Localization | NTU RGB+D | Harmonic Mean (12 unseen classes) | 42.56 | SA-DVAE |
| Action Localization | NTU RGB+D | Harmonic Mean (5 unseen classes) | 66.27 | SA-DVAE |
| Action Localization | NTU RGB+D | Random Split Harmonic Mean | 75.27 | SA-DVAE |
| Action Localization | NTU RGB+D | Random Split Harmonic Mean | 75.51 | SA-DVAE + augmented text |
| Action Localization | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 60.42 | SA-DVAE |
| Action Localization | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 44.5 | SA-DVAE |
| Action Localization | NTU RGB+D 120 | Random Split Harmonic Mean | 47.54 | SA-DVAE |
| Action Localization | NTU RGB+D 120 | Random Split Harmonic Mean | 50.72 | SA-DVAE + augmented text |
| Action Localization | PKU-MMD | Random Split Harmonic Mean | 54.72 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 68.77 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 46.12 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D 120 | Random Split Accuracy | 50.67 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D 120 | Random Split Accuracy | 57.16 | SA-DVAE + augmented text |
| 3D Action Recognition | PKU-MMD | Random Split Accuracy | 66.54 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 41.38 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 82.37 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D | Random Split Accuracy | 84.2 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D | Random Split Accuracy | 87.61 | SA-DVAE + augmented text |
| 3D Action Recognition | NTU RGB+D | Harmonic Mean (12 unseen classes) | 42.56 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D | Harmonic Mean (5 unseen classes) | 66.27 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D | Random Split Harmonic Mean | 75.27 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D | Random Split Harmonic Mean | 75.51 | SA-DVAE + augmented text |
| 3D Action Recognition | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 60.42 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 44.5 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D 120 | Random Split Harmonic Mean | 47.54 | SA-DVAE |
| 3D Action Recognition | NTU RGB+D 120 | Random Split Harmonic Mean | 50.72 | SA-DVAE + augmented text |
| 3D Action Recognition | PKU-MMD | Random Split Harmonic Mean | 54.72 | SA-DVAE |
| Action Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 68.77 | SA-DVAE |
| Action Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 46.12 | SA-DVAE |
| Action Recognition | NTU RGB+D 120 | Random Split Accuracy | 50.67 | SA-DVAE |
| Action Recognition | NTU RGB+D 120 | Random Split Accuracy | 57.16 | SA-DVAE + augmented text |
| Action Recognition | PKU-MMD | Random Split Accuracy | 66.54 | SA-DVAE |
| Action Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 41.38 | SA-DVAE |
| Action Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 82.37 | SA-DVAE |
| Action Recognition | NTU RGB+D | Random Split Accuracy | 84.2 | SA-DVAE |
| Action Recognition | NTU RGB+D | Random Split Accuracy | 87.61 | SA-DVAE + augmented text |
| Action Recognition | NTU RGB+D | Harmonic Mean (12 unseen classes) | 42.56 | SA-DVAE |
| Action Recognition | NTU RGB+D | Harmonic Mean (5 unseen classes) | 66.27 | SA-DVAE |
| Action Recognition | NTU RGB+D | Random Split Harmonic Mean | 75.27 | SA-DVAE |
| Action Recognition | NTU RGB+D | Random Split Harmonic Mean | 75.51 | SA-DVAE + augmented text |
| Action Recognition | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 60.42 | SA-DVAE |
| Action Recognition | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 44.5 | SA-DVAE |
| Action Recognition | NTU RGB+D 120 | Random Split Harmonic Mean | 47.54 | SA-DVAE |
| Action Recognition | NTU RGB+D 120 | Random Split Harmonic Mean | 50.72 | SA-DVAE + augmented text |
| Action Recognition | PKU-MMD | Random Split Harmonic Mean | 54.72 | SA-DVAE |