Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, Vijayan K. Asari
The attention mechanism provides a sequential prediction framework for learning spatial models with enhanced implicit temporal consistency. In this work, we show a systematic design (from 2D to 3D) for how conventional networks and other forms of constraints can be incorporated into the attention framework for learning long-range dependencies for the task of pose estimation. The contribution of this paper is to provide a systematic approach for designing and training of attention-based models for the end-to-end pose estimation, with the flexibility and scalability of arbitrary video sequences as input. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Besides, the proposed architecture can be easily adapted to a causal model enabling real-time performance. Any off-the-shelf 2D pose estimation systems, e.g. Mocap libraries, can be easily integrated in an ad-hoc fashion. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4 mm on Human3.6M dataset.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| 3D Human Pose Estimation | HumanEva-I | Mean Reconstruction Error (mm) | 15.4 | Attention (T=27 MA) |
| 3D Human Pose Estimation | Human3.6M | Average MPJPE (mm) | 44.8 | Attention (T=243 CPN) |
| Pose Estimation | HumanEva-I | Mean Reconstruction Error (mm) | 15.4 | Attention (T=27 MA) |
| Pose Estimation | Human3.6M | Average MPJPE (mm) | 44.8 | Attention (T=243 CPN) |
| 3D | HumanEva-I | Mean Reconstruction Error (mm) | 15.4 | Attention (T=27 MA) |
| 3D | Human3.6M | Average MPJPE (mm) | 44.8 | Attention (T=243 CPN) |
| 1 Image, 2*2 Stitchi | HumanEva-I | Mean Reconstruction Error (mm) | 15.4 | Attention (T=27 MA) |
| 1 Image, 2*2 Stitchi | Human3.6M | Average MPJPE (mm) | 44.8 | Attention (T=243 CPN) |