Yuya Obinata, Takuma Yamamoto
We present a module that extends the temporal graph of a graph convolutional network (GCN) for action recognition with a sequence of skeletons. Existing methods attempt to represent a more appropriate spatial graph on an intra-frame, but disregard optimization of the temporal graph on the interframe. Concretely, these methods connect between vertices corresponding only to the same joint on the inter-frame. In this work, we focus on adding connections to neighboring multiple vertices on the inter-frame and extracting additional features based on the extended temporal graph. Our module is a simple yet effective method to extract correlated features of multiple joints in human movement. Moreover, our module aids in further performance improvements, along with other GCN methods that optimize only the spatial graph. We conduct extensive experiments on two large datasets, NTU RGB+D and Kinetics-Skeleton, and demonstrate that our module is effective for several existing models and our final model achieves state-of-the-art performance.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-Skeleton dataset | Accuracy | 38.6 | 2s-AGCN+TEM |
| Video | NTU RGB+D | Accuracy (CS) | 91 | MS-AAGCN+TEM |
| Video | NTU RGB+D | Accuracy (CV) | 96.5 | MS-AAGCN+TEM |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 38.6 | 2s-AGCN+TEM |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 91 | MS-AAGCN+TEM |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.5 | MS-AAGCN+TEM |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 38.6 | 2s-AGCN+TEM |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 91 | MS-AAGCN+TEM |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.5 | MS-AAGCN+TEM |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 38.6 | 2s-AGCN+TEM |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 91 | MS-AAGCN+TEM |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.5 | MS-AAGCN+TEM |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 38.6 | 2s-AGCN+TEM |
| Action Localization | NTU RGB+D | Accuracy (CS) | 91 | MS-AAGCN+TEM |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.5 | MS-AAGCN+TEM |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 38.6 | 2s-AGCN+TEM |
| Action Detection | NTU RGB+D | Accuracy (CS) | 91 | MS-AAGCN+TEM |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.5 | MS-AAGCN+TEM |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 38.6 | 2s-AGCN+TEM |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 91 | MS-AAGCN+TEM |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.5 | MS-AAGCN+TEM |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 38.6 | 2s-AGCN+TEM |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 91 | MS-AAGCN+TEM |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.5 | MS-AAGCN+TEM |