Jeonghyeok Do, Munchurl Kim
We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs. In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated from the remarkable performance of text-to-image diffusion models, we leverage their alignment capabilities between different modalities mostly by focusing on the training process during reverse diffusion rather than using their generative power. Based on this, our framework is designed as a Triplet Diffusion for Skeleton-Text Matching (TDSM) method which aligns skeleton features with text prompts through reverse diffusion, embedding the prompts into the unified skeleton-text latent space to achieve robust matching. To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones. Our TDSM significantly outperforms the very recent state-of-the-art methods with large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (10 unseen classes) | 74.15 | TDSM |
| Video | NTU RGB+D 120 | Accuracy (24 unseen classes) | 65.06 | TDSM |
| Video | NTU RGB+D 120 | Random Split Accuracy | 69.47 | TDSM |
| Video | PKU-MMD | Random Split Accuracy | 70.76 | TDSM |
| Video | NTU RGB+D | Accuracy (12 unseen classes) | 56.03 | TDSM |
| Video | NTU RGB+D | Accuracy (5 unseen classes) | 86.49 | TDSM |
| Video | NTU RGB+D | Random Split Accuracy | 88.88 | TDSM |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (10 unseen classes) | 74.15 | TDSM |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (24 unseen classes) | 65.06 | TDSM |
| Temporal Action Localization | NTU RGB+D 120 | Random Split Accuracy | 69.47 | TDSM |
| Temporal Action Localization | PKU-MMD | Random Split Accuracy | 70.76 | TDSM |
| Temporal Action Localization | NTU RGB+D | Accuracy (12 unseen classes) | 56.03 | TDSM |
| Temporal Action Localization | NTU RGB+D | Accuracy (5 unseen classes) | 86.49 | TDSM |
| Temporal Action Localization | NTU RGB+D | Random Split Accuracy | 88.88 | TDSM |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (10 unseen classes) | 74.15 | TDSM |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (24 unseen classes) | 65.06 | TDSM |
| Zero-Shot Learning | NTU RGB+D 120 | Random Split Accuracy | 69.47 | TDSM |
| Zero-Shot Learning | PKU-MMD | Random Split Accuracy | 70.76 | TDSM |
| Zero-Shot Learning | NTU RGB+D | Accuracy (12 unseen classes) | 56.03 | TDSM |
| Zero-Shot Learning | NTU RGB+D | Accuracy (5 unseen classes) | 86.49 | TDSM |
| Zero-Shot Learning | NTU RGB+D | Random Split Accuracy | 88.88 | TDSM |
| Activity Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 74.15 | TDSM |
| Activity Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 65.06 | TDSM |
| Activity Recognition | NTU RGB+D 120 | Random Split Accuracy | 69.47 | TDSM |
| Activity Recognition | PKU-MMD | Random Split Accuracy | 70.76 | TDSM |
| Activity Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 56.03 | TDSM |
| Activity Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 86.49 | TDSM |
| Activity Recognition | NTU RGB+D | Random Split Accuracy | 88.88 | TDSM |
| Action Localization | NTU RGB+D 120 | Accuracy (10 unseen classes) | 74.15 | TDSM |
| Action Localization | NTU RGB+D 120 | Accuracy (24 unseen classes) | 65.06 | TDSM |
| Action Localization | NTU RGB+D 120 | Random Split Accuracy | 69.47 | TDSM |
| Action Localization | PKU-MMD | Random Split Accuracy | 70.76 | TDSM |
| Action Localization | NTU RGB+D | Accuracy (12 unseen classes) | 56.03 | TDSM |
| Action Localization | NTU RGB+D | Accuracy (5 unseen classes) | 86.49 | TDSM |
| Action Localization | NTU RGB+D | Random Split Accuracy | 88.88 | TDSM |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 74.15 | TDSM |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 65.06 | TDSM |
| 3D Action Recognition | NTU RGB+D 120 | Random Split Accuracy | 69.47 | TDSM |
| 3D Action Recognition | PKU-MMD | Random Split Accuracy | 70.76 | TDSM |
| 3D Action Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 56.03 | TDSM |
| 3D Action Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 86.49 | TDSM |
| 3D Action Recognition | NTU RGB+D | Random Split Accuracy | 88.88 | TDSM |
| Action Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 74.15 | TDSM |
| Action Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 65.06 | TDSM |
| Action Recognition | NTU RGB+D 120 | Random Split Accuracy | 69.47 | TDSM |
| Action Recognition | PKU-MMD | Random Split Accuracy | 70.76 | TDSM |
| Action Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 56.03 | TDSM |
| Action Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 86.49 | TDSM |
| Action Recognition | NTU RGB+D | Random Split Accuracy | 88.88 | TDSM |