Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai
Current methods for single-image depth estimation use training datasets with real image-depth pairs or stereo pairs, which are not easy to acquire. We propose a framework, trained on synthetic image-depth pairs and unpaired real images, that comprises an image translation network for enhancing realism of input images, followed by a depth prediction network. A key idea is having the first network act as a wide-spectrum input translator, taking in either synthetic or real images, and ideally producing minimally modified realistic images. This is done via a reconstruction loss when the training input is real, and GAN loss when synthetic, removing the need for heuristic self-regularization. The second network is trained on a task loss for synthetic image-depth pairs, with extra GAN loss to unify real and synthetic feature distributions. Importantly, the framework can be trained end-to-end, leading to good results, even surpassing early deep-learning methods that use real paired data.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | DCM | Abs Rel | 0.351 | T2Net |
| Depth Estimation | DCM | RMSE | 1.117 | T2Net |
| Depth Estimation | DCM | RMSE log | 0.415 | T2Net |
| Depth Estimation | DCM | Sq Rel | 0.416 | T2Net |
| Depth Estimation | eBDtheque | Abs Rel | 0.491 | T2Net |
| Depth Estimation | eBDtheque | RMSE | 1.459 | T2Net |
| Depth Estimation | eBDtheque | RMSE log | 0.777 | T2Net |
| Depth Estimation | eBDtheque | Sq Rel | 0.555 | T2Net |
| Domain Adaptation | virtual KITTI to KITTI (MDE) | RMSE | 4.674 | T2Net |
| 3D | DCM | Abs Rel | 0.351 | T2Net |
| 3D | DCM | RMSE | 1.117 | T2Net |
| 3D | DCM | RMSE log | 0.415 | T2Net |
| 3D | DCM | Sq Rel | 0.416 | T2Net |
| 3D | eBDtheque | Abs Rel | 0.491 | T2Net |
| 3D | eBDtheque | RMSE | 1.459 | T2Net |
| 3D | eBDtheque | RMSE log | 0.777 | T2Net |
| 3D | eBDtheque | Sq Rel | 0.555 | T2Net |
| Unsupervised Domain Adaptation | virtual KITTI to KITTI (MDE) | RMSE | 4.674 | T2Net |