Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, Wanli Ouyang
Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Assembly101 | Actions Top-1 | 28.7 | MS-G3D |
| Video | Assembly101 | Object Top-1 | 36.3 | MS-G3D |
| Video | Assembly101 | Verbs Top-1 | 65.7 | MS-G3D |
| Video | Kinetics-Skeleton dataset | Accuracy | 38 | MS-G3D |
| Video | NTU RGB+D | Accuracy (CS) | 91.5 | MS-G3D Net |
| Video | NTU RGB+D | Accuracy (CV) | 96.2 | MS-G3D Net |
| Temporal Action Localization | Assembly101 | Actions Top-1 | 28.7 | MS-G3D |
| Temporal Action Localization | Assembly101 | Object Top-1 | 36.3 | MS-G3D |
| Temporal Action Localization | Assembly101 | Verbs Top-1 | 65.7 | MS-G3D |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 38 | MS-G3D |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 91.5 | MS-G3D Net |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.2 | MS-G3D Net |
| Zero-Shot Learning | Assembly101 | Actions Top-1 | 28.7 | MS-G3D |
| Zero-Shot Learning | Assembly101 | Object Top-1 | 36.3 | MS-G3D |
| Zero-Shot Learning | Assembly101 | Verbs Top-1 | 65.7 | MS-G3D |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 38 | MS-G3D |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 91.5 | MS-G3D Net |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.2 | MS-G3D Net |
| Activity Recognition | H2O (2 Hands and Objects) | Actions Top-1 | 50.83 | MS-G3D |
| Activity Recognition | Assembly101 | Actions Top-1 | 28.7 | MS-G3D |
| Activity Recognition | Assembly101 | Object Top-1 | 36.3 | MS-G3D |
| Activity Recognition | Assembly101 | Verbs Top-1 | 65.7 | MS-G3D |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 38 | MS-G3D |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 91.5 | MS-G3D Net |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.2 | MS-G3D Net |
| Action Localization | Assembly101 | Actions Top-1 | 28.7 | MS-G3D |
| Action Localization | Assembly101 | Object Top-1 | 36.3 | MS-G3D |
| Action Localization | Assembly101 | Verbs Top-1 | 65.7 | MS-G3D |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 38 | MS-G3D |
| Action Localization | NTU RGB+D | Accuracy (CS) | 91.5 | MS-G3D Net |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.2 | MS-G3D Net |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 38 | MS-G3D |
| Action Detection | NTU RGB+D | Accuracy (CS) | 91.5 | MS-G3D Net |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.2 | MS-G3D Net |
| 3D Action Recognition | Assembly101 | Actions Top-1 | 28.7 | MS-G3D |
| 3D Action Recognition | Assembly101 | Object Top-1 | 36.3 | MS-G3D |
| 3D Action Recognition | Assembly101 | Verbs Top-1 | 65.7 | MS-G3D |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 38 | MS-G3D |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 91.5 | MS-G3D Net |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.2 | MS-G3D Net |
| Action Recognition | H2O (2 Hands and Objects) | Actions Top-1 | 50.83 | MS-G3D |
| Action Recognition | Assembly101 | Actions Top-1 | 28.7 | MS-G3D |
| Action Recognition | Assembly101 | Object Top-1 | 36.3 | MS-G3D |
| Action Recognition | Assembly101 | Verbs Top-1 | 65.7 | MS-G3D |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 38 | MS-G3D |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 91.5 | MS-G3D Net |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.2 | MS-G3D Net |