Mohammadreza Zolfaghari, Gabriel L. Oliveira, Nima Sedaghat, Thomas Brox
General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | JHMDB (2D poses only) | Average accuracy of 3 splits | 56.8 | Chained |
| Video | J-HMDB | Accuracy (RGB+pose) | 76.1 | Chained (RGB+Flow +Pose) |
| Video | J-HMDB | Accuracy (pose) | 56.8 | Chained (RGB+Flow +Pose) |
| Temporal Action Localization | JHMDB (2D poses only) | Average accuracy of 3 splits | 56.8 | Chained |
| Temporal Action Localization | J-HMDB | Accuracy (RGB+pose) | 76.1 | Chained (RGB+Flow +Pose) |
| Temporal Action Localization | J-HMDB | Accuracy (pose) | 56.8 | Chained (RGB+Flow +Pose) |
| Zero-Shot Learning | JHMDB (2D poses only) | Average accuracy of 3 splits | 56.8 | Chained |
| Zero-Shot Learning | J-HMDB | Accuracy (RGB+pose) | 76.1 | Chained (RGB+Flow +Pose) |
| Zero-Shot Learning | J-HMDB | Accuracy (pose) | 56.8 | Chained (RGB+Flow +Pose) |
| Activity Recognition | JHMDB (2D poses only) | Average accuracy of 3 splits | 56.8 | Chained |
| Activity Recognition | J-HMDB | Accuracy (RGB+pose) | 76.1 | Chained (RGB+Flow +Pose) |
| Activity Recognition | J-HMDB | Accuracy (pose) | 56.8 | Chained (RGB+Flow +Pose) |
| Action Localization | JHMDB (2D poses only) | Average accuracy of 3 splits | 56.8 | Chained |
| Action Localization | J-HMDB | Accuracy (RGB+pose) | 76.1 | Chained (RGB+Flow +Pose) |
| Action Localization | J-HMDB | Accuracy (pose) | 56.8 | Chained (RGB+Flow +Pose) |
| Action Detection | JHMDB (2D poses only) | Average accuracy of 3 splits | 56.8 | Chained |
| Action Detection | J-HMDB | Accuracy (RGB+pose) | 76.1 | Chained (RGB+Flow +Pose) |
| Action Detection | J-HMDB | Accuracy (pose) | 56.8 | Chained (RGB+Flow +Pose) |
| 3D Action Recognition | JHMDB (2D poses only) | Average accuracy of 3 splits | 56.8 | Chained |
| 3D Action Recognition | J-HMDB | Accuracy (RGB+pose) | 76.1 | Chained (RGB+Flow +Pose) |
| 3D Action Recognition | J-HMDB | Accuracy (pose) | 56.8 | Chained (RGB+Flow +Pose) |
| Action Recognition | JHMDB (2D poses only) | Average accuracy of 3 splits | 56.8 | Chained |
| Action Recognition | J-HMDB | Accuracy (RGB+pose) | 76.1 | Chained (RGB+Flow +Pose) |
| Action Recognition | J-HMDB | Accuracy (pose) | 56.8 | Chained (RGB+Flow +Pose) |