Chao Li, Qiaoyong Zhong, Di Xie, ShiLiang Pu
Skeleton-based human action recognition has recently drawn increasing attentions with the availability of large-scale skeleton datasets. The most crucial factors for this task lie in two aspects: the intra-frame representation for joint co-occurrences and the inter-frame representation for skeletons' temporal evolutions. In this paper we propose an end-to-end convolutional co-occurrence feature learning framework. The co-occurrence features are learned with a hierarchical methodology, in which different levels of contextual information are aggregated gradually. Firstly point-level information of each joint is encoded independently. Then they are assembled into semantic representation in both spatial and temporal domains. Specifically, we introduce a global spatial aggregation scheme, which is able to learn superior joint co-occurrence features over local aggregation. Besides, raw skeleton coordinates as well as their temporal difference are integrated with a two-stream paradigm. Experiments show that our approach consistently outperforms other state-of-the-arts on action recognition and detection benchmarks like NTU RGB+D, SBU Kinect Interaction and PKU-MMD.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | PKU-MMD | mAP@0.50 (CS) | 92.6 | HCN |
| Video | PKU-MMD | mAP@0.50 (CV) | 94.2 | HCN |
| Video | NTU RGB+D | Accuracy (CS) | 86.5 | HCN |
| Video | NTU RGB+D | Accuracy (CV) | 91.1 | HCN |
| Temporal Action Localization | PKU-MMD | mAP@0.50 (CS) | 92.6 | HCN |
| Temporal Action Localization | PKU-MMD | mAP@0.50 (CV) | 94.2 | HCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 86.5 | HCN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 91.1 | HCN |
| Zero-Shot Learning | PKU-MMD | mAP@0.50 (CS) | 92.6 | HCN |
| Zero-Shot Learning | PKU-MMD | mAP@0.50 (CV) | 94.2 | HCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 86.5 | HCN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 91.1 | HCN |
| Activity Recognition | PKU-MMD | mAP@0.50 (CS) | 92.6 | HCN |
| Activity Recognition | PKU-MMD | mAP@0.50 (CV) | 94.2 | HCN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 86.5 | HCN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 91.1 | HCN |
| Action Localization | PKU-MMD | mAP@0.50 (CS) | 92.6 | HCN |
| Action Localization | PKU-MMD | mAP@0.50 (CV) | 94.2 | HCN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 86.5 | HCN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 91.1 | HCN |
| Pose Estimation | RF-MMD | mAP (@0.1, Through-wall) | 78.5 | HCN |
| Pose Estimation | RF-MMD | mAP (@0.1, Visible) | 825 | HCN |
| Action Detection | PKU-MMD | mAP@0.50 (CS) | 92.6 | HCN |
| Action Detection | PKU-MMD | mAP@0.50 (CV) | 94.2 | HCN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 86.5 | HCN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 91.1 | HCN |
| 3D Action Recognition | PKU-MMD | mAP@0.50 (CS) | 92.6 | HCN |
| 3D Action Recognition | PKU-MMD | mAP@0.50 (CV) | 94.2 | HCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 86.5 | HCN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 91.1 | HCN |
| 3D | RF-MMD | mAP (@0.1, Through-wall) | 78.5 | HCN |
| 3D | RF-MMD | mAP (@0.1, Visible) | 825 | HCN |
| Action Recognition | PKU-MMD | mAP@0.50 (CS) | 92.6 | HCN |
| Action Recognition | PKU-MMD | mAP@0.50 (CV) | 94.2 | HCN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 86.5 | HCN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 91.1 | HCN |
| 1 Image, 2*2 Stitchi | RF-MMD | mAP (@0.1, Through-wall) | 78.5 | HCN |
| 1 Image, 2*2 Stitchi | RF-MMD | mAP (@0.1, Visible) | 825 | HCN |