Pranay Gupta, Divyanshu Sharma, Ravi Kiran Sarvadevabhatla
We introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition. Our design choices enable SynSE to generalize compositionally, i.e., recognize sequences whose action descriptions contain words not encountered during training. We also extend our approach to the more challenging Generalized Zero-Shot Learning (GZSL) problem via a confidence-based gating mechanism. We are the first to present zero-shot skeleton action recognition results on the large-scale NTU-60 and NTU-120 skeleton action datasets with multiple splits. Our results demonstrate SynSE's state of the art performance in both ZSL and GZSL settings compared to strong baselines on the NTU-60 and NTU-120 datasets. The code and pretrained models are available at https://github.com/skelemoa/synse-zsl
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (10 unseen classes) | 62.69 | SynSE |
| Video | NTU RGB+D 120 | Accuracy (24 unseen classes) | 38.7 | SynSE |
| Video | PKU-MMD | Random Split Accuracy | 53.85 | SynSE |
| Video | NTU RGB+D | Accuracy (12 unseen classes) | 33.3 | SynSE |
| Video | NTU RGB+D | Accuracy (5 unseen classes) | 75.81 | SynSE |
| Video | NTU RGB+D | Random Split Accuracy | 64.19 | SynSE |
| Video | NTU RGB+D | Harmonic Mean (12 unseen classes) | 36.33 | SynSE |
| Video | NTU RGB+D | Harmonic Mean (5 unseen classes) | 59.02 | SynSE |
| Video | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 54.94 | SynSE |
| Video | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 41.04 | SynSE |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (10 unseen classes) | 62.69 | SynSE |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (24 unseen classes) | 38.7 | SynSE |
| Temporal Action Localization | PKU-MMD | Random Split Accuracy | 53.85 | SynSE |
| Temporal Action Localization | NTU RGB+D | Accuracy (12 unseen classes) | 33.3 | SynSE |
| Temporal Action Localization | NTU RGB+D | Accuracy (5 unseen classes) | 75.81 | SynSE |
| Temporal Action Localization | NTU RGB+D | Random Split Accuracy | 64.19 | SynSE |
| Temporal Action Localization | NTU RGB+D | Harmonic Mean (12 unseen classes) | 36.33 | SynSE |
| Temporal Action Localization | NTU RGB+D | Harmonic Mean (5 unseen classes) | 59.02 | SynSE |
| Temporal Action Localization | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 54.94 | SynSE |
| Temporal Action Localization | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 41.04 | SynSE |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (10 unseen classes) | 62.69 | SynSE |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (24 unseen classes) | 38.7 | SynSE |
| Zero-Shot Learning | PKU-MMD | Random Split Accuracy | 53.85 | SynSE |
| Zero-Shot Learning | NTU RGB+D | Accuracy (12 unseen classes) | 33.3 | SynSE |
| Zero-Shot Learning | NTU RGB+D | Accuracy (5 unseen classes) | 75.81 | SynSE |
| Zero-Shot Learning | NTU RGB+D | Random Split Accuracy | 64.19 | SynSE |
| Zero-Shot Learning | NTU RGB+D | Harmonic Mean (12 unseen classes) | 36.33 | SynSE |
| Zero-Shot Learning | NTU RGB+D | Harmonic Mean (5 unseen classes) | 59.02 | SynSE |
| Zero-Shot Learning | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 54.94 | SynSE |
| Zero-Shot Learning | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 41.04 | SynSE |
| Activity Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 62.69 | SynSE |
| Activity Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 38.7 | SynSE |
| Activity Recognition | PKU-MMD | Random Split Accuracy | 53.85 | SynSE |
| Activity Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 33.3 | SynSE |
| Activity Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 75.81 | SynSE |
| Activity Recognition | NTU RGB+D | Random Split Accuracy | 64.19 | SynSE |
| Activity Recognition | NTU RGB+D | Harmonic Mean (12 unseen classes) | 36.33 | SynSE |
| Activity Recognition | NTU RGB+D | Harmonic Mean (5 unseen classes) | 59.02 | SynSE |
| Activity Recognition | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 54.94 | SynSE |
| Activity Recognition | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 41.04 | SynSE |
| Action Localization | NTU RGB+D 120 | Accuracy (10 unseen classes) | 62.69 | SynSE |
| Action Localization | NTU RGB+D 120 | Accuracy (24 unseen classes) | 38.7 | SynSE |
| Action Localization | PKU-MMD | Random Split Accuracy | 53.85 | SynSE |
| Action Localization | NTU RGB+D | Accuracy (12 unseen classes) | 33.3 | SynSE |
| Action Localization | NTU RGB+D | Accuracy (5 unseen classes) | 75.81 | SynSE |
| Action Localization | NTU RGB+D | Random Split Accuracy | 64.19 | SynSE |
| Action Localization | NTU RGB+D | Harmonic Mean (12 unseen classes) | 36.33 | SynSE |
| Action Localization | NTU RGB+D | Harmonic Mean (5 unseen classes) | 59.02 | SynSE |
| Action Localization | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 54.94 | SynSE |
| Action Localization | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 41.04 | SynSE |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 62.69 | SynSE |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 38.7 | SynSE |
| 3D Action Recognition | PKU-MMD | Random Split Accuracy | 53.85 | SynSE |
| 3D Action Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 33.3 | SynSE |
| 3D Action Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 75.81 | SynSE |
| 3D Action Recognition | NTU RGB+D | Random Split Accuracy | 64.19 | SynSE |
| 3D Action Recognition | NTU RGB+D | Harmonic Mean (12 unseen classes) | 36.33 | SynSE |
| 3D Action Recognition | NTU RGB+D | Harmonic Mean (5 unseen classes) | 59.02 | SynSE |
| 3D Action Recognition | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 54.94 | SynSE |
| 3D Action Recognition | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 41.04 | SynSE |
| Action Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 62.69 | SynSE |
| Action Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 38.7 | SynSE |
| Action Recognition | PKU-MMD | Random Split Accuracy | 53.85 | SynSE |
| Action Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 33.3 | SynSE |
| Action Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 75.81 | SynSE |
| Action Recognition | NTU RGB+D | Random Split Accuracy | 64.19 | SynSE |
| Action Recognition | NTU RGB+D | Harmonic Mean (12 unseen classes) | 36.33 | SynSE |
| Action Recognition | NTU RGB+D | Harmonic Mean (5 unseen classes) | 59.02 | SynSE |
| Action Recognition | NTU RGB+D 120 | Harmonic Mean (10 unseen classes) | 54.94 | SynSE |
| Action Recognition | NTU RGB+D 120 | Harmonic Mean (24 unseen classes) | 41.04 | SynSE |