Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Garrepalli, Fatih Porikli
In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | NYU-Depth V2 | Delta < 1.25 | 0.981 | FutureDepth |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25^2 | 0.996 | FutureDepth |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25^3 | 0.999 | FutureDepth |
| Depth Estimation | NYU-Depth V2 | RMSE | 0.233 | FutureDepth |
| Depth Estimation | NYU-Depth V2 | absolute relative error | 0.063 | FutureDepth |
| Depth Estimation | NYU-Depth V2 | log 10 | 0.027 | FutureDepth |
| Depth Estimation | KITTI Eigen split | Delta < 1.25 | 0.984 | FutureDepth |
| Depth Estimation | KITTI Eigen split | Delta < 1.25^2 | 0.998 | FutureDepth |
| Depth Estimation | KITTI Eigen split | Delta < 1.25^3 | 1 | FutureDepth |
| Depth Estimation | KITTI Eigen split | RMSE | 1.856 | FutureDepth |
| Depth Estimation | KITTI Eigen split | RMSE log | 0.066 | FutureDepth |
| Depth Estimation | KITTI Eigen split | Sq Rel | 0.117 | FutureDepth |
| Depth Estimation | KITTI Eigen split | Square relative error (SqRel) | 0.117 | FutureDepth |
| Depth Estimation | KITTI Eigen split | absolute relative error | 0.041 | FutureDepth |
| 3D | NYU-Depth V2 | Delta < 1.25 | 0.981 | FutureDepth |
| 3D | NYU-Depth V2 | Delta < 1.25^2 | 0.996 | FutureDepth |
| 3D | NYU-Depth V2 | Delta < 1.25^3 | 0.999 | FutureDepth |
| 3D | NYU-Depth V2 | RMSE | 0.233 | FutureDepth |
| 3D | NYU-Depth V2 | absolute relative error | 0.063 | FutureDepth |
| 3D | NYU-Depth V2 | log 10 | 0.027 | FutureDepth |
| 3D | KITTI Eigen split | Delta < 1.25 | 0.984 | FutureDepth |
| 3D | KITTI Eigen split | Delta < 1.25^2 | 0.998 | FutureDepth |
| 3D | KITTI Eigen split | Delta < 1.25^3 | 1 | FutureDepth |
| 3D | KITTI Eigen split | RMSE | 1.856 | FutureDepth |
| 3D | KITTI Eigen split | RMSE log | 0.066 | FutureDepth |
| 3D | KITTI Eigen split | Sq Rel | 0.117 | FutureDepth |
| 3D | KITTI Eigen split | Square relative error (SqRel) | 0.117 | FutureDepth |
| 3D | KITTI Eigen split | absolute relative error | 0.041 | FutureDepth |