Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, Bo Dai
Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt graph convolutional networks (GCN) to extract features on top of human skeletons. Despite the positive results shown in previous works, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseC3D, a new approach to skeleton-based action recognition, which relies on a 3D heatmap stack instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseC3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseC3D can handle multiple-person scenarios without additional computation cost, and its features can be easily integrated with other modalities at early fusion stages, which provides a great design space to further boost the performance. On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Assembly101 | Actions Top-1 | 33.61 | RGBPoseConv3D |
| Video | Assembly101 | Object Top-1 | 42.9 | RGBPoseConv3D |
| Video | Assembly101 | Verbs Top-1 | 61.99 | RGBPoseConv3D |
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | PoseC3D (w. HRNet 2D Skeleton) |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.9 | PoseC3D (w. HRNet 2D Skeleton) |
| Video | Kinetics-Skeleton dataset | Accuracy | 49.1 | PoseC3D (SlowOnly-346) |
| Video | Kinetics-Skeleton dataset | Accuracy | 47.7 | PoseC3D |
| Video | NTU RGB+D | Accuracy (CS) | 94.1 | PoseC3D [3D Heatmap] |
| Video | NTU RGB+D | Accuracy (CV) | 97.1 | PoseC3D [3D Heatmap] |
| Video | NTU RGB+D | Ensembled Modalities | 2 | PoseC3D [3D Heatmap] |
| Temporal Action Localization | Assembly101 | Actions Top-1 | 33.61 | RGBPoseConv3D |
| Temporal Action Localization | Assembly101 | Object Top-1 | 42.9 | RGBPoseConv3D |
| Temporal Action Localization | Assembly101 | Verbs Top-1 | 61.99 | RGBPoseConv3D |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | PoseC3D (w. HRNet 2D Skeleton) |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.9 | PoseC3D (w. HRNet 2D Skeleton) |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 49.1 | PoseC3D (SlowOnly-346) |
| Temporal Action Localization | Kinetics-Skeleton dataset | Accuracy | 47.7 | PoseC3D |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 94.1 | PoseC3D [3D Heatmap] |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 97.1 | PoseC3D [3D Heatmap] |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 2 | PoseC3D [3D Heatmap] |
| Zero-Shot Learning | Assembly101 | Actions Top-1 | 33.61 | RGBPoseConv3D |
| Zero-Shot Learning | Assembly101 | Object Top-1 | 42.9 | RGBPoseConv3D |
| Zero-Shot Learning | Assembly101 | Verbs Top-1 | 61.99 | RGBPoseConv3D |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | PoseC3D (w. HRNet 2D Skeleton) |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.9 | PoseC3D (w. HRNet 2D Skeleton) |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 49.1 | PoseC3D (SlowOnly-346) |
| Zero-Shot Learning | Kinetics-Skeleton dataset | Accuracy | 47.7 | PoseC3D |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 94.1 | PoseC3D [3D Heatmap] |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 97.1 | PoseC3D [3D Heatmap] |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 2 | PoseC3D [3D Heatmap] |
| Activity Recognition | Volleyball | Accuracy | 91.3 | PoseC3D (Pose Only) |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 97 | PoseC3D (RGB + Pose) |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 99.6 | PoseC3D (RGB + Pose) |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 96.4 | PoseC3D (RGB + Pose) |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 95.3 | PoseC3D (RGB + Pose) |
| Activity Recognition | H2O (2 Hands and Objects) | Actions Top-1 | 83.47 | RGBPoseConv3D |
| Activity Recognition | Assembly101 | Actions Top-1 | 33.61 | RGBPoseConv3D |
| Activity Recognition | Assembly101 | Object Top-1 | 42.9 | RGBPoseConv3D |
| Activity Recognition | Assembly101 | Verbs Top-1 | 61.99 | RGBPoseConv3D |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | PoseC3D (w. HRNet 2D Skeleton) |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.9 | PoseC3D (w. HRNet 2D Skeleton) |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 49.1 | PoseC3D (SlowOnly-346) |
| Activity Recognition | Kinetics-Skeleton dataset | Accuracy | 47.7 | PoseC3D |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 94.1 | PoseC3D [3D Heatmap] |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 97.1 | PoseC3D [3D Heatmap] |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 2 | PoseC3D [3D Heatmap] |
| Activity Recognition | Volleyball | Accuracy | 91.3 | PoseC3D (Pose-Only) |
| Action Localization | Assembly101 | Actions Top-1 | 33.61 | RGBPoseConv3D |
| Action Localization | Assembly101 | Object Top-1 | 42.9 | RGBPoseConv3D |
| Action Localization | Assembly101 | Verbs Top-1 | 61.99 | RGBPoseConv3D |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | PoseC3D (w. HRNet 2D Skeleton) |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.9 | PoseC3D (w. HRNet 2D Skeleton) |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 49.1 | PoseC3D (SlowOnly-346) |
| Action Localization | Kinetics-Skeleton dataset | Accuracy | 47.7 | PoseC3D |
| Action Localization | NTU RGB+D | Accuracy (CS) | 94.1 | PoseC3D [3D Heatmap] |
| Action Localization | NTU RGB+D | Accuracy (CV) | 97.1 | PoseC3D [3D Heatmap] |
| Action Localization | NTU RGB+D | Ensembled Modalities | 2 | PoseC3D [3D Heatmap] |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | PoseC3D (w. HRNet 2D Skeleton) |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.9 | PoseC3D (w. HRNet 2D Skeleton) |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 49.1 | PoseC3D (SlowOnly-346) |
| Action Detection | Kinetics-Skeleton dataset | Accuracy | 47.7 | PoseC3D |
| Action Detection | NTU RGB+D | Accuracy (CS) | 94.1 | PoseC3D [3D Heatmap] |
| Action Detection | NTU RGB+D | Accuracy (CV) | 97.1 | PoseC3D [3D Heatmap] |
| Action Detection | NTU RGB+D | Ensembled Modalities | 2 | PoseC3D [3D Heatmap] |
| 3D Action Recognition | Assembly101 | Actions Top-1 | 33.61 | RGBPoseConv3D |
| 3D Action Recognition | Assembly101 | Object Top-1 | 42.9 | RGBPoseConv3D |
| 3D Action Recognition | Assembly101 | Verbs Top-1 | 61.99 | RGBPoseConv3D |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | PoseC3D (w. HRNet 2D Skeleton) |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.9 | PoseC3D (w. HRNet 2D Skeleton) |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 49.1 | PoseC3D (SlowOnly-346) |
| 3D Action Recognition | Kinetics-Skeleton dataset | Accuracy | 47.7 | PoseC3D |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 94.1 | PoseC3D [3D Heatmap] |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 97.1 | PoseC3D [3D Heatmap] |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 2 | PoseC3D [3D Heatmap] |
| Action Recognition | Volleyball | Accuracy | 91.3 | PoseC3D (Pose Only) |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 97 | PoseC3D (RGB + Pose) |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 99.6 | PoseC3D (RGB + Pose) |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 96.4 | PoseC3D (RGB + Pose) |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 95.3 | PoseC3D (RGB + Pose) |
| Action Recognition | H2O (2 Hands and Objects) | Actions Top-1 | 83.47 | RGBPoseConv3D |
| Action Recognition | Assembly101 | Actions Top-1 | 33.61 | RGBPoseConv3D |
| Action Recognition | Assembly101 | Object Top-1 | 42.9 | RGBPoseConv3D |
| Action Recognition | Assembly101 | Verbs Top-1 | 61.99 | RGBPoseConv3D |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 90.3 | PoseC3D (w. HRNet 2D Skeleton) |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.9 | PoseC3D (w. HRNet 2D Skeleton) |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 49.1 | PoseC3D (SlowOnly-346) |
| Action Recognition | Kinetics-Skeleton dataset | Accuracy | 47.7 | PoseC3D |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 94.1 | PoseC3D [3D Heatmap] |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 97.1 | PoseC3D [3D Heatmap] |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 2 | PoseC3D [3D Heatmap] |