TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PrimeDepth: Efficient Monocular Depth Estimation with a St...

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

Denis Zavadski, Damjan Kalšan, Carsten Rother

2024-09-13Zero-shot GeneralizationScene UnderstandingDepth EstimationMonocular Depth Estimation
PaperPDFCode(official)

Abstract

This work addresses the task of zero-shot monocular depth estimation. A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion. Foundation models provide a rich and generic image representation, and therefore, little training data is required to reformulate them as a depth estimation model that predicts highly-detailed depth maps and has good generalisation capabilities. However, the realisation of this idea has so far led to approaches which are, unfortunately, highly inefficient at test-time due to the underlying iterative denoising process. In this work, we propose a different realisation of this idea and present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches. Our key idea is to extract from Stable Diffusion a rich, but frozen, image representation by running a single denoising step. This representation, we term preimage, is then fed into a refiner network with an architectural inductive bias, before entering the downstream task. We validate experimentally that PrimeDepth is two orders of magnitude faster than the leading diffusion-based method, Marigold, while being more robust for challenging scenarios and quantitatively marginally superior. Thereby, we reduce the gap to the currently leading data-driven approach, Depth Anything, which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data. Due to the complementary nature of our approach, even a simple averaging between PrimeDepth and Depth Anything predictions can improve upon both methods and sets a new state-of-the-art in zero-shot monocular depth estimation. In future, data-driven approaches may also benefit from integrating our preimage.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2Delta < 1.250.977PrimeDepth + Depth Anything
Depth EstimationNYU-Depth V2absolute relative error0.046PrimeDepth + Depth Anything
Depth EstimationNYU-Depth V2Delta < 1.250.966PrimeDepth
Depth EstimationNYU-Depth V2absolute relative error0.058PrimeDepth
Depth EstimationETH3DDelta < 1.250.967PrimeDepth
Depth EstimationETH3Dabsolute relative error0.068PrimeDepth
Depth EstimationKITTI Eigen splitDelta < 1.250.953PrimeDepth + Depth Anything
Depth EstimationKITTI Eigen splitabsolute relative error0.073PrimeDepth + Depth Anything
Depth EstimationKITTI Eigen splitDelta < 1.250.937PrimeDepth
Depth EstimationKITTI Eigen splitabsolute relative error0.079PrimeDepth
3DNYU-Depth V2Delta < 1.250.977PrimeDepth + Depth Anything
3DNYU-Depth V2absolute relative error0.046PrimeDepth + Depth Anything
3DNYU-Depth V2Delta < 1.250.966PrimeDepth
3DNYU-Depth V2absolute relative error0.058PrimeDepth
3DETH3DDelta < 1.250.967PrimeDepth
3DETH3Dabsolute relative error0.068PrimeDepth
3DKITTI Eigen splitDelta < 1.250.953PrimeDepth + Depth Anything
3DKITTI Eigen splitabsolute relative error0.073PrimeDepth + Depth Anything
3DKITTI Eigen splitDelta < 1.250.937PrimeDepth
3DKITTI Eigen splitabsolute relative error0.079PrimeDepth

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16