Cheng Sun, Min Sun, Hwann-Tzong Chen
We present HoHoNet, a versatile and efficient framework for holistic understanding of an indoor 360-degree panorama using a Latent Horizontal Feature (LHFeat). The compact LHFeat flattens the features along the vertical direction and has shown success in modeling per-column modality for room layout reconstruction. HoHoNet advances in two important aspects. First, the deep architecture is redesigned to run faster with improved accuracy. Second, we propose a novel horizon-to-dense module, which relaxes the per-column output shape constraint, allowing per-pixel dense prediction from LHFeat. HoHoNet is fast: It runs at 52 FPS and 110 FPS with ResNet-50 and ResNet-34 backbones respectively, for modeling dense modalities from a high-resolution $512 \times 1024$ panorama. HoHoNet is also accurate. On the tasks of layout estimation and semantic segmentation, HoHoNet achieves results on par with current state-of-the-art. On dense depth estimation, HoHoNet outperforms all the prior arts by a large margin.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | Stanford2D3D Panoramic | RMSE | 0.3834 | HoHoNet (ResNet-101) |
| Depth Estimation | Stanford2D3D Panoramic | absolute relative error | 0.1014 | HoHoNet (ResNet-101) |
| 3D Reconstruction | Stanford2D3D Panoramic | 3DIoU | 79.88 | HoHoNet (ResNet-101) |
| Scene Parsing | Stanford2D3D Panoramic | 3DIoU | 79.88 | HoHoNet (ResNet-101) |
| Semantic Segmentation | Stanford2D3D Panoramic - RGBD | mAcc | 68.9 | HoHoNet (ResNet-101) |
| Semantic Segmentation | Stanford2D3D Panoramic - RGBD | mIoU | 56.3 | HoHoNet (ResNet-101) |
| Semantic Segmentation | Stanford2D3D Panoramic | mAcc | 65 | HoHoNet (ResNet-101) |
| 3D | Stanford2D3D Panoramic | RMSE | 0.3834 | HoHoNet (ResNet-101) |
| 3D | Stanford2D3D Panoramic | absolute relative error | 0.1014 | HoHoNet (ResNet-101) |
| 3D | Stanford2D3D Panoramic | 3DIoU | 79.88 | HoHoNet (ResNet-101) |
| Scene Understanding | Stanford2D3D Panoramic | 3DIoU | 79.88 | HoHoNet (ResNet-101) |
| 2D Semantic Segmentation | Stanford2D3D Panoramic | 3DIoU | 79.88 | HoHoNet (ResNet-101) |
| 10-shot image generation | Stanford2D3D Panoramic - RGBD | mAcc | 68.9 | HoHoNet (ResNet-101) |
| 10-shot image generation | Stanford2D3D Panoramic - RGBD | mIoU | 56.3 | HoHoNet (ResNet-101) |
| 10-shot image generation | Stanford2D3D Panoramic | mAcc | 65 | HoHoNet (ResNet-101) |