Hengkai Guo, Guijin Wang, Xinghao Chen, Cairong Zhang, Fei Qiao, Huazhong Yang
Hand pose estimation from monocular depth images is an important and challenging problem for human-computer interaction. Recently deep convolutional networks (ConvNet) with sophisticated design have been employed to address it, but the improvement over traditional methods is not so apparent. To promote the performance of directly 3D coordinate regression, we propose a tree-structured Region Ensemble Network (REN), which partitions the convolution outputs into regions and integrates the results from multiple regressors on each regions. Compared with multi-model ensemble, our model is completely end-to-end training. The experimental results demonstrate that our approach achieves the best performance among state-of-the-arts on two public datasets.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Hand | MSRA Hands | Average 3D Error | 9.8 | REN |
| Hand | ICVL Hands | Average 3D Error | 7.5 | REN |
| Hand | NYU Hands | Average 3D Error | 12.7 | REN |
| Pose Estimation | MSRA Hands | Average 3D Error | 9.8 | REN |
| Pose Estimation | ICVL Hands | Average 3D Error | 7.5 | REN |
| Pose Estimation | NYU Hands | Average 3D Error | 12.7 | REN |
| Hand Pose Estimation | MSRA Hands | Average 3D Error | 9.8 | REN |
| Hand Pose Estimation | ICVL Hands | Average 3D Error | 7.5 | REN |
| Hand Pose Estimation | NYU Hands | Average 3D Error | 12.7 | REN |
| 3D | MSRA Hands | Average 3D Error | 9.8 | REN |
| 3D | ICVL Hands | Average 3D Error | 7.5 | REN |
| 3D | NYU Hands | Average 3D Error | 12.7 | REN |
| 1 Image, 2*2 Stitchi | MSRA Hands | Average 3D Error | 9.8 | REN |
| 1 Image, 2*2 Stitchi | ICVL Hands | Average 3D Error | 7.5 | REN |
| 1 Image, 2*2 Stitchi | NYU Hands | Average 3D Error | 12.7 | REN |