TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Metric3Dv2: A Versatile Monocular Geometric Foundation Mod...

Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Kaixuan Wang, Hao Chen, Gang Yu, Chunhua Shen, Shaojie Shen

2024-03-22Under review for Transaction 2024 4Zero-shot GeneralizationSurface Normal EstimationDepth Estimation
PaperPDFCode(official)

Abstract

We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. SoTA monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our project page is at https://JUGGHM.github.io/Metric3Dv2.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2Delta < 1.250.989Metric3Dv2(L, FT)
Depth EstimationNYU-Depth V2Delta < 1.25^20.998Metric3Dv2(L, FT)
Depth EstimationNYU-Depth V2Delta < 1.25^31Metric3Dv2(L, FT)
Depth EstimationNYU-Depth V2RMSE0.183Metric3Dv2(L, FT)
Depth EstimationNYU-Depth V2absolute relative error0.047Metric3Dv2(L, FT)
Depth EstimationNYU-Depth V2log 100.02Metric3Dv2(L, FT)
Depth EstimationIBims-1δ1.250.969Metric3D-v2(L, ZS)
Depth EstimationKITTI Eigen splitDelta < 1.250.989Metric3Dv2 (g2, FT, 80m, flip_aug_test)
Depth EstimationKITTI Eigen splitDelta < 1.25^20.998Metric3Dv2 (g2, FT, 80m, flip_aug_test)
Depth EstimationKITTI Eigen splitDelta < 1.25^31Metric3Dv2 (g2, FT, 80m, flip_aug_test)
Depth EstimationKITTI Eigen splitRMSE1.766Metric3Dv2 (g2, FT, 80m, flip_aug_test)
Depth EstimationKITTI Eigen splitRMSE log0.06Metric3Dv2 (g2, FT, 80m, flip_aug_test)
Depth EstimationKITTI Eigen splitabsolute relative error0.039Metric3Dv2 (g2, FT, 80m, flip_aug_test)
3DNYU-Depth V2Delta < 1.250.989Metric3Dv2(L, FT)
3DNYU-Depth V2Delta < 1.25^20.998Metric3Dv2(L, FT)
3DNYU-Depth V2Delta < 1.25^31Metric3Dv2(L, FT)
3DNYU-Depth V2RMSE0.183Metric3Dv2(L, FT)
3DNYU-Depth V2absolute relative error0.047Metric3Dv2(L, FT)
3DNYU-Depth V2log 100.02Metric3Dv2(L, FT)
3DIBims-1δ1.250.969Metric3D-v2(L, ZS)
3DKITTI Eigen splitDelta < 1.250.989Metric3Dv2 (g2, FT, 80m, flip_aug_test)
3DKITTI Eigen splitDelta < 1.25^20.998Metric3Dv2 (g2, FT, 80m, flip_aug_test)
3DKITTI Eigen splitDelta < 1.25^31Metric3Dv2 (g2, FT, 80m, flip_aug_test)
3DKITTI Eigen splitRMSE1.766Metric3Dv2 (g2, FT, 80m, flip_aug_test)
3DKITTI Eigen splitRMSE log0.06Metric3Dv2 (g2, FT, 80m, flip_aug_test)
3DKITTI Eigen splitabsolute relative error0.039Metric3Dv2 (g2, FT, 80m, flip_aug_test)
Surface Normals EstimationIBims-1% < 11.2569.7Metric3Dv2(g2, ZS)
Surface Normals EstimationIBims-1% < 22.576.2Metric3Dv2(g2, ZS)
Surface Normals EstimationIBims-1% < 3078.8Metric3Dv2(g2, ZS)
Surface Normals EstimationIBims-1Mean19.6Metric3Dv2(g2, ZS)
Surface Normals EstimationScanNetV2% < 11.2577.8Metric3Dv2 (g2, In-domain)
Surface Normals EstimationScanNetV2% < 22.590.1Metric3Dv2 (g2, In-domain)
Surface Normals EstimationScanNetV2% < 3093.5Metric3Dv2 (g2, In-domain)
Surface Normals EstimationScanNetV2Mean Angle Error9.2Metric3Dv2 (g2, In-domain)
Surface Normals EstimationNYU Depth v2% < 11.2568.8Metric3Dv2(L, FT)
Surface Normals EstimationNYU Depth v2% < 22.584.9Metric3Dv2(L, FT)
Surface Normals EstimationNYU Depth v2% < 3089.8Metric3Dv2(L, FT)
Surface Normals EstimationNYU Depth v2Mean Angle Error12Metric3Dv2(L, FT)
Surface Normals EstimationNYU Depth v2RMSE19.2Metric3Dv2(L, FT)

Related Papers

$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation2025-07-15MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network2025-07-15Cameras as Relative Positional Encoding2025-07-14