Vítor Albiero, Xingyu Chen, Xi Yin, Guan Pang, Tal Hassner
We propose real-time, six degrees of freedom (6DoF), 3D face pose estimation without face detection or landmark localization. We observe that estimating the 6DoF rigid transformation of a face is a simpler problem than facial landmark detection, often used for 3D face alignment. In addition, 6DoF offers more information than face bounding box labels. We leverage these observations to make multiple contributions: (a) We describe an easily trained, efficient, Faster R-CNN--based model which regresses 6DoF pose for all faces in the photo, without preliminary face detection. (b) We explain how pose is converted and kept consistent between the input photo and arbitrary crops created while training and evaluating our model. (c) Finally, we show how face poses can replace detection bounding box training labels. Tests on AFLW2000-3D and BIWI show that our method runs at real-time and outperforms state of the art (SotA) face pose estimators. Remarkably, our method also surpasses SotA models of comparable complexity on the WIDER FACE detection benchmark, despite not been optimized on bounding box labels.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Facial Recognition and Modelling | WIDER Face (Medium) | AP | 0.89 | img2pose |
| Facial Recognition and Modelling | WIDER Face (Easy) | AP | 0.9 | img2pose |
| Facial Recognition and Modelling | WIDER Face (Hard) | AP | 0.839 | img2pose |
| Pose Estimation | AFLW2000 | Geodesic Error (GE) | 6.41 | img2pose |
| Pose Estimation | AFLW2000 | MAE | 3.913 | img2pose |
| Pose Estimation | AFLW2000 | MAE_t | 0.099 | img2pose |
| Pose Estimation | AFLW2000 | MAE | 4.839 | RetinaFace R-50 (5 points) |
| Pose Estimation | AFLW2000 | MAE_t | 0.114 | RetinaFace R-50 (5 points) |
| Pose Estimation | BIWI | Geodesic Error (GE) | 7.1 | img2pose |
| Pose Estimation | BIWI | Geodesic Error - aligned (GE) | 6.23 | img2pose |
| Pose Estimation | BIWI | MAE (trained with other data) | 3.786 | img2pose |
| Pose Estimation | BIWI | MAE-aligned (trained with other data) | 3.4 | img2pose |
| Pose Estimation | BIWI | MAE (trained with other data) | 4.578 | RetinaFace R-50 (5 points) |
| Face Detection | WIDER Face (Medium) | AP | 0.89 | img2pose |
| Face Detection | WIDER Face (Easy) | AP | 0.9 | img2pose |
| Face Detection | WIDER Face (Hard) | AP | 0.839 | img2pose |
| Face Reconstruction | WIDER Face (Medium) | AP | 0.89 | img2pose |
| Face Reconstruction | WIDER Face (Easy) | AP | 0.9 | img2pose |
| Face Reconstruction | WIDER Face (Hard) | AP | 0.839 | img2pose |
| 3D | AFLW2000 | Geodesic Error (GE) | 6.41 | img2pose |
| 3D | AFLW2000 | MAE | 3.913 | img2pose |
| 3D | AFLW2000 | MAE_t | 0.099 | img2pose |
| 3D | AFLW2000 | MAE | 4.839 | RetinaFace R-50 (5 points) |
| 3D | AFLW2000 | MAE_t | 0.114 | RetinaFace R-50 (5 points) |
| 3D | BIWI | Geodesic Error (GE) | 7.1 | img2pose |
| 3D | BIWI | Geodesic Error - aligned (GE) | 6.23 | img2pose |
| 3D | BIWI | MAE (trained with other data) | 3.786 | img2pose |
| 3D | BIWI | MAE-aligned (trained with other data) | 3.4 | img2pose |
| 3D | BIWI | MAE (trained with other data) | 4.578 | RetinaFace R-50 (5 points) |
| 3D | WIDER Face (Medium) | AP | 0.89 | img2pose |
| 3D | WIDER Face (Easy) | AP | 0.9 | img2pose |
| 3D | WIDER Face (Hard) | AP | 0.839 | img2pose |
| 3D Face Modelling | WIDER Face (Medium) | AP | 0.89 | img2pose |
| 3D Face Modelling | WIDER Face (Easy) | AP | 0.9 | img2pose |
| 3D Face Modelling | WIDER Face (Hard) | AP | 0.839 | img2pose |
| 3D Face Reconstruction | WIDER Face (Medium) | AP | 0.89 | img2pose |
| 3D Face Reconstruction | WIDER Face (Easy) | AP | 0.9 | img2pose |
| 3D Face Reconstruction | WIDER Face (Hard) | AP | 0.839 | img2pose |
| 1 Image, 2*2 Stitchi | AFLW2000 | Geodesic Error (GE) | 6.41 | img2pose |
| 1 Image, 2*2 Stitchi | AFLW2000 | MAE | 3.913 | img2pose |
| 1 Image, 2*2 Stitchi | AFLW2000 | MAE_t | 0.099 | img2pose |
| 1 Image, 2*2 Stitchi | AFLW2000 | MAE | 4.839 | RetinaFace R-50 (5 points) |
| 1 Image, 2*2 Stitchi | AFLW2000 | MAE_t | 0.114 | RetinaFace R-50 (5 points) |
| 1 Image, 2*2 Stitchi | BIWI | Geodesic Error (GE) | 7.1 | img2pose |
| 1 Image, 2*2 Stitchi | BIWI | Geodesic Error - aligned (GE) | 6.23 | img2pose |
| 1 Image, 2*2 Stitchi | BIWI | MAE (trained with other data) | 3.786 | img2pose |
| 1 Image, 2*2 Stitchi | BIWI | MAE-aligned (trained with other data) | 3.4 | img2pose |
| 1 Image, 2*2 Stitchi | BIWI | MAE (trained with other data) | 4.578 | RetinaFace R-50 (5 points) |