Clément Godard, Oisin Mac Aodha, Michael Firman, Gabriel Brostow
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | KITTI Eigen split | absolute relative error | 0.106 | monodepth2 M |
| Depth Estimation | Mid-Air Dataset | Abs Rel | 0.717 | Monodepth2 |
| Depth Estimation | Mid-Air Dataset | RMSE | 74.552 | Monodepth2 |
| Depth Estimation | Mid-Air Dataset | RMSE log | 0.882 | Monodepth2 |
| Depth Estimation | Mid-Air Dataset | SQ Rel | 37.164 | Monodepth2 |
| Depth Estimation | VA (Virtual Apartment) | Absolute relative error (AbsRel) | 0.203 | MonoDepth2 |
| Depth Estimation | VA (Virtual Apartment) | Log root mean square error (RMSE_log) | 0.251 | MonoDepth2 |
| Depth Estimation | VA (Virtual Apartment) | Mean average error (MAE) | 0.295 | MonoDepth2 |
| Depth Estimation | VA (Virtual Apartment) | Root mean square error (RMSE) | 0.432 | MonoDepth2 |
| Depth Estimation | Make3D | Abs Rel | 0.322 | Monodepth2 |
| Depth Estimation | Make3D | RMSE | 7.417 | Monodepth2 |
| Depth Estimation | Make3D | Sq Rel | 3.589 | Monodepth2 |
| 3D | KITTI Eigen split | absolute relative error | 0.106 | monodepth2 M |
| 3D | Mid-Air Dataset | Abs Rel | 0.717 | Monodepth2 |
| 3D | Mid-Air Dataset | RMSE | 74.552 | Monodepth2 |
| 3D | Mid-Air Dataset | RMSE log | 0.882 | Monodepth2 |
| 3D | Mid-Air Dataset | SQ Rel | 37.164 | Monodepth2 |
| 3D | VA (Virtual Apartment) | Absolute relative error (AbsRel) | 0.203 | MonoDepth2 |
| 3D | VA (Virtual Apartment) | Log root mean square error (RMSE_log) | 0.251 | MonoDepth2 |
| 3D | VA (Virtual Apartment) | Mean average error (MAE) | 0.295 | MonoDepth2 |
| 3D | VA (Virtual Apartment) | Root mean square error (RMSE) | 0.432 | MonoDepth2 |
| 3D | Make3D | Abs Rel | 0.322 | Monodepth2 |
| 3D | Make3D | RMSE | 7.417 | Monodepth2 |
| 3D | Make3D | Sq Rel | 3.589 | Monodepth2 |
| Camera Pose Estimation | KITTI Odometry Benchmark | Absolute Trajectory Error [m] | 93.04 | Monodepth2 |
| Camera Pose Estimation | KITTI Odometry Benchmark | Average Rotational Error er[%] | 20.72 | Monodepth2 |
| Camera Pose Estimation | KITTI Odometry Benchmark | Average Translational Error et[%] | 43.21 | Monodepth2 |