Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, Wanli Ouyang

2020-03-31CVPR 2020 63D Action Recognition Skeleton Based Action Recognition Long-range modeling Action Recognition

Abstract

Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.

Results

Task	Dataset	Metric	Value	Model
Video	Assembly101	Actions Top-1	28.7	MS-G3D
Video	Assembly101	Object Top-1	36.3	MS-G3D
Video	Assembly101	Verbs Top-1	65.7	MS-G3D
Video	Kinetics-Skeleton dataset	Accuracy	38	MS-G3D
Video	NTU RGB+D	Accuracy (CS)	91.5	MS-G3D Net
Video	NTU RGB+D	Accuracy (CV)	96.2	MS-G3D Net
Temporal Action Localization	Assembly101	Actions Top-1	28.7	MS-G3D
Temporal Action Localization	Assembly101	Object Top-1	36.3	MS-G3D
Temporal Action Localization	Assembly101	Verbs Top-1	65.7	MS-G3D
Temporal Action Localization	Kinetics-Skeleton dataset	Accuracy	38	MS-G3D
Temporal Action Localization	NTU RGB+D	Accuracy (CS)	91.5	MS-G3D Net
Temporal Action Localization	NTU RGB+D	Accuracy (CV)	96.2	MS-G3D Net
Zero-Shot Learning	Assembly101	Actions Top-1	28.7	MS-G3D
Zero-Shot Learning	Assembly101	Object Top-1	36.3	MS-G3D
Zero-Shot Learning	Assembly101	Verbs Top-1	65.7	MS-G3D
Zero-Shot Learning	Kinetics-Skeleton dataset	Accuracy	38	MS-G3D
Zero-Shot Learning	NTU RGB+D	Accuracy (CS)	91.5	MS-G3D Net
Zero-Shot Learning	NTU RGB+D	Accuracy (CV)	96.2	MS-G3D Net
Activity Recognition	H2O (2 Hands and Objects)	Actions Top-1	50.83	MS-G3D
Activity Recognition	Assembly101	Actions Top-1	28.7	MS-G3D
Activity Recognition	Assembly101	Object Top-1	36.3	MS-G3D
Activity Recognition	Assembly101	Verbs Top-1	65.7	MS-G3D
Activity Recognition	Kinetics-Skeleton dataset	Accuracy	38	MS-G3D
Activity Recognition	NTU RGB+D	Accuracy (CS)	91.5	MS-G3D Net
Activity Recognition	NTU RGB+D	Accuracy (CV)	96.2	MS-G3D Net
Action Localization	Assembly101	Actions Top-1	28.7	MS-G3D
Action Localization	Assembly101	Object Top-1	36.3	MS-G3D
Action Localization	Assembly101	Verbs Top-1	65.7	MS-G3D
Action Localization	Kinetics-Skeleton dataset	Accuracy	38	MS-G3D
Action Localization	NTU RGB+D	Accuracy (CS)	91.5	MS-G3D Net
Action Localization	NTU RGB+D	Accuracy (CV)	96.2	MS-G3D Net
Action Detection	Kinetics-Skeleton dataset	Accuracy	38	MS-G3D
Action Detection	NTU RGB+D	Accuracy (CS)	91.5	MS-G3D Net
Action Detection	NTU RGB+D	Accuracy (CV)	96.2	MS-G3D Net
3D Action Recognition	Assembly101	Actions Top-1	28.7	MS-G3D
3D Action Recognition	Assembly101	Object Top-1	36.3	MS-G3D
3D Action Recognition	Assembly101	Verbs Top-1	65.7	MS-G3D
3D Action Recognition	Kinetics-Skeleton dataset	Accuracy	38	MS-G3D
3D Action Recognition	NTU RGB+D	Accuracy (CS)	91.5	MS-G3D Net
3D Action Recognition	NTU RGB+D	Accuracy (CV)	96.2	MS-G3D Net
Action Recognition	H2O (2 Hands and Objects)	Actions Top-1	50.83	MS-G3D
Action Recognition	Assembly101	Actions Top-1	28.7	MS-G3D
Action Recognition	Assembly101	Object Top-1	36.3	MS-G3D
Action Recognition	Assembly101	Verbs Top-1	65.7	MS-G3D
Action Recognition	Kinetics-Skeleton dataset	Accuracy	38	MS-G3D
Action Recognition	NTU RGB+D	Accuracy (CS)	91.5	MS-G3D Net
Action Recognition	NTU RGB+D	Accuracy (CV)	96.2	MS-G3D Net

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Abstract

Results

Related Papers

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Abstract

Results

Related Papers