Steven Hickson, Karthik Raveendran, Alireza Fathi, Kevin Murphy, Irfan Essa
We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image. These insights are: (1) denoise the "ground truth" surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and finetuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved results on several datasets, using a model that runs at 12 fps on a standard mobile phone.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | ScanNetV2 | Pixel Accuracy | 65.6 | Floors are Flat |
| Surface Normals Estimation | ScanNetV2 | % < 11.25 | 50.9 | Floors are Flat |
| Surface Normals Estimation | ScanNetV2 | % < 22.5 | 65.2 | Floors are Flat |
| Surface Normals Estimation | ScanNetV2 | % < 30 | 70 | Floors are Flat |
| Surface Normals Estimation | ScanNetV2 | Mean Angle Error | 28 | Floors are Flat |
| Surface Normals Estimation | NYU Depth v2 | % < 11.25 | 59.5 | Floors are Flat |
| Surface Normals Estimation | NYU Depth v2 | % < 22.5 | 72.2 | Floors are Flat |
| Surface Normals Estimation | NYU Depth v2 | % < 30 | 77.3 | Floors are Flat |
| Surface Normals Estimation | NYU Depth v2 | Mean Angle Error | 19.7 | Floors are Flat |
| Surface Normals Estimation | NYU Depth v2 | RMSE | 19.3 | Floors are Flat |
| 10-shot image generation | ScanNetV2 | Pixel Accuracy | 65.6 | Floors are Flat |