Anshul Shah, Shlok Mishra, Ankan Bansal, Jun-Cheng Chen, Rama Chellappa, Abhinav Shrivastava
Recent progress on action recognition has mainly focused on RGB and optical flow features. In this paper, we approach the problem of joint-based action recognition. Unlike other modalities, constellation of joints and their motion generate models with succinct human motion information for activity recognition. We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder before performing collective reasoning. Our joint selector module re-weights the joint information to select the most discriminative joints for the task. We also propose a novel joint-contrastive loss that pulls together groups of joint features which convey the same action. We strengthen the joint-based representations by using a geometry-aware data augmentation technique which jitters pose heatmaps while retaining the dynamics of the action. We show large improvements over the current state-of-the-art joint-based approaches on JHMDB, HMDB, Charades, AVA action recognition datasets. A late fusion with RGB and Flow-based approaches yields additional improvements. Our model also outperforms the existing baseline on Mimetics, a dataset with out-of-context actions.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | JHMDB (2D poses only) | Average accuracy of 3 splits | 68.55 | JMRN (No GT pose) |
| Video | Charades | MAP | 43.23 | JMRN + R101-NL-LFB |
| Video | Charades | MAP | 16.2 | JMRN (Pose only) |
| Temporal Action Localization | JHMDB (2D poses only) | Average accuracy of 3 splits | 68.55 | JMRN (No GT pose) |
| Zero-Shot Learning | JHMDB (2D poses only) | Average accuracy of 3 splits | 68.55 | JMRN (No GT pose) |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 84.53 | Ours + ResNext101 BERT |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 54.2 | JRMN |
| Activity Recognition | AVA v2.1 | mAP (Val) | 28.4 | JMRN + SlowFast-R101-NL |
| Activity Recognition | Mimetics | mAP | 40 | JMRN |
| Activity Recognition | Mimetics | mAP | 38.3 | SIP-Net |
| Activity Recognition | JHMDB (2D poses only) | Average accuracy of 3 splits | 68.55 | JMRN (No GT pose) |
| Action Localization | JHMDB (2D poses only) | Average accuracy of 3 splits | 68.55 | JMRN (No GT pose) |
| Action Detection | JHMDB (2D poses only) | Average accuracy of 3 splits | 68.55 | JMRN (No GT pose) |
| 3D Action Recognition | JHMDB (2D poses only) | Average accuracy of 3 splits | 68.55 | JMRN (No GT pose) |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 84.53 | Ours + ResNext101 BERT |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 54.2 | JRMN |
| Action Recognition | AVA v2.1 | mAP (Val) | 28.4 | JMRN + SlowFast-R101-NL |
| Action Recognition | Mimetics | mAP | 40 | JMRN |
| Action Recognition | Mimetics | mAP | 38.3 | SIP-Net |
| Action Recognition | JHMDB (2D poses only) | Average accuracy of 3 splits | 68.55 | JMRN (No GT pose) |