Yi Feng, Zizhan Guo, Qijun Chen, Rui Fan
Unsupervised monocular depth estimation frameworks have shown promising performance in autonomous driving. However, existing solutions primarily rely on a simple convolutional neural network for ego-motion recovery, which struggles to estimate precise camera poses in dynamic, complicated real-world scenarios. These inaccurately estimated camera poses can inevitably deteriorate the photometric reconstruction and mislead the depth estimation networks with wrong supervisory signals. In this article, we introduce SCIPaD, a novel approach that incorporates spatial clues for unsupervised depth-pose joint learning. Specifically, a confidence-aware feature flow estimator is proposed to acquire 2D feature positional translations and their associated confidence levels. Meanwhile, we introduce a positional clue aggregator, which integrates pseudo 3D point clouds from DepthNet and 2D feature flows into homogeneous positional representations. Finally, a hierarchical positional embedding injector is proposed to selectively inject spatial clues into semantic features for robust camera pose decoding. Extensive experiments and analyses demonstrate the superior performance of our model compared to other state-of-the-art methods. Remarkably, SCIPaD achieves a reduction of 22.2\% in average translation error and 34.8\% in average angular error for camera pose estimation task on the KITTI Odometry dataset. Our source code is available at \url{https://mias.group/SCIPaD}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | KITTI Eigen split unsupervised | Delta < 1.25 | 0.918 | SCIPaD |
| Depth Estimation | KITTI Eigen split unsupervised | Delta < 1.25^2 | 0.97 | SCIPaD |
| Depth Estimation | KITTI Eigen split unsupervised | Delta < 1.25^3 | 0.985 | SCIPaD |
| Depth Estimation | KITTI Eigen split unsupervised | RMSE | 4.056 | SCIPaD |
| Depth Estimation | KITTI Eigen split unsupervised | RMSE log | 0.166 | SCIPaD |
| Depth Estimation | KITTI Eigen split unsupervised | Sq Rel | 0.65 | SCIPaD |
| Depth Estimation | KITTI Eigen split unsupervised | absolute relative error | 0.09 | SCIPaD |
| Depth Estimation | KITTI Eigen split unsupervised | Delta < 1.25 | 0.897 | SCIPaD(M+640x192) |
| Depth Estimation | KITTI Eigen split unsupervised | Delta < 1.25^2 | 0.964 | SCIPaD(M+640x192) |
| Depth Estimation | KITTI Eigen split unsupervised | Delta < 1.25^3 | 0.983 | SCIPaD(M+640x192) |
| Depth Estimation | KITTI Eigen split unsupervised | RMSE | 4.391 | SCIPaD(M+640x192) |
| Depth Estimation | KITTI Eigen split unsupervised | RMSE log | 0.175 | SCIPaD(M+640x192) |
| Depth Estimation | KITTI Eigen split unsupervised | Sq Rel | 0.732 | SCIPaD(M+640x192) |
| Depth Estimation | KITTI Eigen split unsupervised | absolute relative error | 0.098 | SCIPaD(M+640x192) |
| 3D | KITTI Eigen split unsupervised | Delta < 1.25 | 0.918 | SCIPaD |
| 3D | KITTI Eigen split unsupervised | Delta < 1.25^2 | 0.97 | SCIPaD |
| 3D | KITTI Eigen split unsupervised | Delta < 1.25^3 | 0.985 | SCIPaD |
| 3D | KITTI Eigen split unsupervised | RMSE | 4.056 | SCIPaD |
| 3D | KITTI Eigen split unsupervised | RMSE log | 0.166 | SCIPaD |
| 3D | KITTI Eigen split unsupervised | Sq Rel | 0.65 | SCIPaD |
| 3D | KITTI Eigen split unsupervised | absolute relative error | 0.09 | SCIPaD |
| 3D | KITTI Eigen split unsupervised | Delta < 1.25 | 0.897 | SCIPaD(M+640x192) |
| 3D | KITTI Eigen split unsupervised | Delta < 1.25^2 | 0.964 | SCIPaD(M+640x192) |
| 3D | KITTI Eigen split unsupervised | Delta < 1.25^3 | 0.983 | SCIPaD(M+640x192) |
| 3D | KITTI Eigen split unsupervised | RMSE | 4.391 | SCIPaD(M+640x192) |
| 3D | KITTI Eigen split unsupervised | RMSE log | 0.175 | SCIPaD(M+640x192) |
| 3D | KITTI Eigen split unsupervised | Sq Rel | 0.732 | SCIPaD(M+640x192) |
| 3D | KITTI Eigen split unsupervised | absolute relative error | 0.098 | SCIPaD(M+640x192) |
| Camera Pose Estimation | KITTI Odometry Benchmark | Absolute Trajectory Error [m] | 20.83 | SCIPaD |
| Camera Pose Estimation | KITTI Odometry Benchmark | Average Rotational Error er[%] | 3.17 | SCIPaD |
| Camera Pose Estimation | KITTI Odometry Benchmark | Average Translational Error et[%] | 8.63 | SCIPaD |