Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao Liu, Qiang Xu
Various deep learning techniques have been proposed to solve the single-view 2D-to-3D pose estimation problem. While the average prediction accuracy has been improved significantly over the years, the performance on hard poses with depth ambiguity, self-occlusion, and complex or rare poses is still far from satisfactory. In this work, we target these hard poses and present a novel skeletal GNN learning solution. To be specific, we propose a hop-aware hierarchical channel-squeezing fusion layer to effectively extract relevant information from neighboring nodes while suppressing undesired noises in GNN learning. In addition, we propose a temporal-aware dynamic graph construction procedure that is robust and effective for 3D pose estimation. Experimental results on the Human3.6M dataset show that our solution achieves 10.3\% average prediction accuracy improvement and greatly improves on hard poses over state-of-the-art techniques. We further apply the proposed technique on the skeleton-based action recognition task and also achieve state-of-the-art performance. Our code is available at https://github.com/ailingzengzzz/Skeletal-GNN.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| 3D Human Pose Estimation | MPI-INF-3DHP | AUC | 46.2 | Skeletal GNN |
| 3D Human Pose Estimation | MPI-INF-3DHP | PCK | 82.1 | Skeletal GNN |
| 3D Human Pose Estimation | Human3.6M | Average MPJPE (mm) | 47.9 | Skeletal GNN |
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.2 | Skeletal GNN |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.5 | Skeletal GNN |
| Video | NTU RGB+D 120 | Ensembled Modalities | 4 | Skeletal GNN |
| Video | NTU RGB+D | Accuracy (CS) | 91.6 | Skeletal GNN |
| Video | NTU RGB+D | Accuracy (CV) | 96.7 | Skeletal GNN |
| Video | NTU RGB+D | Ensembled Modalities | 4 | Skeletal GNN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.2 | Skeletal GNN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.5 | Skeletal GNN |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | Skeletal GNN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 91.6 | Skeletal GNN |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 96.7 | Skeletal GNN |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 4 | Skeletal GNN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.2 | Skeletal GNN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.5 | Skeletal GNN |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 4 | Skeletal GNN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 91.6 | Skeletal GNN |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 96.7 | Skeletal GNN |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 4 | Skeletal GNN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.2 | Skeletal GNN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.5 | Skeletal GNN |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | Skeletal GNN |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 91.6 | Skeletal GNN |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 96.7 | Skeletal GNN |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 4 | Skeletal GNN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.2 | Skeletal GNN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.5 | Skeletal GNN |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 4 | Skeletal GNN |
| Action Localization | NTU RGB+D | Accuracy (CS) | 91.6 | Skeletal GNN |
| Action Localization | NTU RGB+D | Accuracy (CV) | 96.7 | Skeletal GNN |
| Action Localization | NTU RGB+D | Ensembled Modalities | 4 | Skeletal GNN |
| Pose Estimation | MPI-INF-3DHP | AUC | 46.2 | Skeletal GNN |
| Pose Estimation | MPI-INF-3DHP | PCK | 82.1 | Skeletal GNN |
| Pose Estimation | Human3.6M | Average MPJPE (mm) | 47.9 | Skeletal GNN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.2 | Skeletal GNN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.5 | Skeletal GNN |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 4 | Skeletal GNN |
| Action Detection | NTU RGB+D | Accuracy (CS) | 91.6 | Skeletal GNN |
| Action Detection | NTU RGB+D | Accuracy (CV) | 96.7 | Skeletal GNN |
| Action Detection | NTU RGB+D | Ensembled Modalities | 4 | Skeletal GNN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.2 | Skeletal GNN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.5 | Skeletal GNN |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | Skeletal GNN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 91.6 | Skeletal GNN |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 96.7 | Skeletal GNN |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | Skeletal GNN |
| 3D | MPI-INF-3DHP | AUC | 46.2 | Skeletal GNN |
| 3D | MPI-INF-3DHP | PCK | 82.1 | Skeletal GNN |
| 3D | Human3.6M | Average MPJPE (mm) | 47.9 | Skeletal GNN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 89.2 | Skeletal GNN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.5 | Skeletal GNN |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 4 | Skeletal GNN |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 91.6 | Skeletal GNN |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 96.7 | Skeletal GNN |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 4 | Skeletal GNN |
| 1 Image, 2*2 Stitchi | MPI-INF-3DHP | AUC | 46.2 | Skeletal GNN |
| 1 Image, 2*2 Stitchi | MPI-INF-3DHP | PCK | 82.1 | Skeletal GNN |
| 1 Image, 2*2 Stitchi | Human3.6M | Average MPJPE (mm) | 47.9 | Skeletal GNN |