Jeonghyeok Do, Munchurl Kim
Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.4 | SkateFormer |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.8 | SkateFormer |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | SkateFormer |
| Video | N-UCLA | Accuracy | 98.3 | SkateFormer |
| Video | NTU RGB+D | Accuracy (CS) | 93.5 | SkateFormer |
| Video | NTU RGB+D | Accuracy (CV) | 97.8 | SkateFormer |
| Video | NTU RGB+D | Ensembled Modalities | 4 | SkateFormer |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.4 | SkateFormer |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.8 | SkateFormer |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | SkateFormer |
| Temporal Action Localization | N-UCLA | Accuracy | 98.3 | SkateFormer |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 93.5 | SkateFormer |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 97.8 | SkateFormer |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 4 | SkateFormer |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.4 | SkateFormer |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.8 | SkateFormer |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | SkateFormer |
| Zero-Shot Learning | N-UCLA | Accuracy | 98.3 | SkateFormer |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 93.5 | SkateFormer |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 97.8 | SkateFormer |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 4 | SkateFormer |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.4 | SkateFormer |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.8 | SkateFormer |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | SkateFormer |
| Activity Recognition | N-UCLA | Accuracy | 98.3 | SkateFormer |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 93.5 | SkateFormer |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 97.8 | SkateFormer |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 4 | SkateFormer |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.4 | SkateFormer |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.8 | SkateFormer |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | SkateFormer |
| Action Localization | N-UCLA | Accuracy | 98.3 | SkateFormer |
| Action Localization | NTU RGB+D | Accuracy (CS) | 93.5 | SkateFormer |
| Action Localization | NTU RGB+D | Accuracy (CV) | 97.8 | SkateFormer |
| Action Localization | NTU RGB+D | Ensembled Modalities | 4 | SkateFormer |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.4 | SkateFormer |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.8 | SkateFormer |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | SkateFormer |
| Action Detection | N-UCLA | Accuracy | 98.3 | SkateFormer |
| Action Detection | NTU RGB+D | Accuracy (CS) | 93.5 | SkateFormer |
| Action Detection | NTU RGB+D | Accuracy (CV) | 97.8 | SkateFormer |
| Action Detection | NTU RGB+D | Ensembled Modalities | 4 | SkateFormer |
| Human Interaction Recognition | NTU RGB+D | Accuracy (Cross-Subject) | 97.1 | SkateFormer |
| Human Interaction Recognition | NTU RGB+D | Accuracy (Cross-View) | 99.3 | SkateFormer |
| Human Interaction Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 93.2 | SkateFormer |
| Human Interaction Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 92.3 | SkateFormer |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.4 | SkateFormer |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.8 | SkateFormer |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | SkateFormer |
| 3D Action Recognition | N-UCLA | Accuracy | 98.3 | SkateFormer |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 93.5 | SkateFormer |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 97.8 | SkateFormer |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | SkateFormer |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.4 | SkateFormer |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 89.8 | SkateFormer |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | SkateFormer |
| Action Recognition | N-UCLA | Accuracy | 98.3 | SkateFormer |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 93.5 | SkateFormer |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 97.8 | SkateFormer |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | SkateFormer |