Deep Optics for Monocular Depth Estimation and 3D Object Detection

Julie Chang, Gordon Wetzstein

2019-04-18ICCV 2019 10Scene Understanding Depth Estimation object-detection 3D Object Detection Object Detection Monocular Depth Estimation

Paper PDF

Abstract

Depth estimation and 3D object detection are critical for scene understanding but remain challenging to perform with a single image due to the loss of 3D information during image capture. Recent models using deep neural networks have improved monocular depth estimation performance, but there is still difficulty in predicting absolute depth and generalizing outside a standard dataset. Here we introduce the paradigm of deep optics, i.e. end-to-end design of optics and image processing, to the monocular depth estimation problem, using coded defocus blur as an additional depth cue to be decoded by a neural network. We evaluate several optical coding strategies along with an end-to-end optimization scheme for depth estimation on three datasets, including NYU Depth v2 and KITTI. We find an optimized freeform lens design yields the best results, but chromatic aberration from a singlet lens offers significantly improved performance as well. We build a physical prototype and validate that chromatic aberrations improve depth estimation on real-world results. In addition, we train object detection networks on the KITTI dataset and show that the lens optimized for depth estimation also results in improved 3D object detection performance.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	RMS	0.4325	Optimized, freeform
Depth Estimation	NYU-Depth V2	RMS	0.433	Freeform
3D	NYU-Depth V2	RMS	0.4325	Optimized, freeform
3D	NYU-Depth V2	RMS	0.433	Freeform

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17 Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 $S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17 $π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17 Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17