Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Kaixuan Wang, Hao Chen, Gang Yu, Chunhua Shen, Shaojie Shen
We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. SoTA monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our project page is at https://JUGGHM.github.io/Metric3Dv2.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | NYU-Depth V2 | Delta < 1.25 | 0.989 | Metric3Dv2(L, FT) |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25^2 | 0.998 | Metric3Dv2(L, FT) |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25^3 | 1 | Metric3Dv2(L, FT) |
| Depth Estimation | NYU-Depth V2 | RMSE | 0.183 | Metric3Dv2(L, FT) |
| Depth Estimation | NYU-Depth V2 | absolute relative error | 0.047 | Metric3Dv2(L, FT) |
| Depth Estimation | NYU-Depth V2 | log 10 | 0.02 | Metric3Dv2(L, FT) |
| Depth Estimation | IBims-1 | δ1.25 | 0.969 | Metric3D-v2(L, ZS) |
| Depth Estimation | KITTI Eigen split | Delta < 1.25 | 0.989 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| Depth Estimation | KITTI Eigen split | Delta < 1.25^2 | 0.998 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| Depth Estimation | KITTI Eigen split | Delta < 1.25^3 | 1 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| Depth Estimation | KITTI Eigen split | RMSE | 1.766 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| Depth Estimation | KITTI Eigen split | RMSE log | 0.06 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| Depth Estimation | KITTI Eigen split | absolute relative error | 0.039 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| 3D | NYU-Depth V2 | Delta < 1.25 | 0.989 | Metric3Dv2(L, FT) |
| 3D | NYU-Depth V2 | Delta < 1.25^2 | 0.998 | Metric3Dv2(L, FT) |
| 3D | NYU-Depth V2 | Delta < 1.25^3 | 1 | Metric3Dv2(L, FT) |
| 3D | NYU-Depth V2 | RMSE | 0.183 | Metric3Dv2(L, FT) |
| 3D | NYU-Depth V2 | absolute relative error | 0.047 | Metric3Dv2(L, FT) |
| 3D | NYU-Depth V2 | log 10 | 0.02 | Metric3Dv2(L, FT) |
| 3D | IBims-1 | δ1.25 | 0.969 | Metric3D-v2(L, ZS) |
| 3D | KITTI Eigen split | Delta < 1.25 | 0.989 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| 3D | KITTI Eigen split | Delta < 1.25^2 | 0.998 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| 3D | KITTI Eigen split | Delta < 1.25^3 | 1 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| 3D | KITTI Eigen split | RMSE | 1.766 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| 3D | KITTI Eigen split | RMSE log | 0.06 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| 3D | KITTI Eigen split | absolute relative error | 0.039 | Metric3Dv2 (g2, FT, 80m, flip_aug_test) |
| Surface Normals Estimation | IBims-1 | % < 11.25 | 69.7 | Metric3Dv2(g2, ZS) |
| Surface Normals Estimation | IBims-1 | % < 22.5 | 76.2 | Metric3Dv2(g2, ZS) |
| Surface Normals Estimation | IBims-1 | % < 30 | 78.8 | Metric3Dv2(g2, ZS) |
| Surface Normals Estimation | IBims-1 | Mean | 19.6 | Metric3Dv2(g2, ZS) |
| Surface Normals Estimation | ScanNetV2 | % < 11.25 | 77.8 | Metric3Dv2 (g2, In-domain) |
| Surface Normals Estimation | ScanNetV2 | % < 22.5 | 90.1 | Metric3Dv2 (g2, In-domain) |
| Surface Normals Estimation | ScanNetV2 | % < 30 | 93.5 | Metric3Dv2 (g2, In-domain) |
| Surface Normals Estimation | ScanNetV2 | Mean Angle Error | 9.2 | Metric3Dv2 (g2, In-domain) |
| Surface Normals Estimation | NYU Depth v2 | % < 11.25 | 68.8 | Metric3Dv2(L, FT) |
| Surface Normals Estimation | NYU Depth v2 | % < 22.5 | 84.9 | Metric3Dv2(L, FT) |
| Surface Normals Estimation | NYU Depth v2 | % < 30 | 89.8 | Metric3Dv2(L, FT) |
| Surface Normals Estimation | NYU Depth v2 | Mean Angle Error | 12 | Metric3Dv2(L, FT) |
| Surface Normals Estimation | NYU Depth v2 | RMSE | 19.2 | Metric3Dv2(L, FT) |