Danila Rukhovich, Anna Vorontsova, Anton Konushin
In this paper, we introduce the task of multi-view RGB-based 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on monocular or multi-view RGB images. The number of monocular images in each multi-view input can variate during training and inference; actually, this number might be unique for each multi-view input. ImVoxelNet successfully handles both indoor and outdoor scenes, which makes it general-purpose. Specifically, it achieves state-of-the-art results in car detection on KITTI (monocular) and nuScenes (multi-view) benchmarks among all methods that accept RGB images. Moreover, it surpasses existing RGB-based 3D object detection methods on the SUN RGB-D dataset. On ScanNet, ImVoxelNet sets a new benchmark for multi-view 3D object detection. The source code and the trained models are available at https://github.com/saic-vul/imvoxelnet.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | DAIR-V2X-I | AP|R40(easy) | 44.8 | ImVoxelNet |
| Object Detection | DAIR-V2X-I | AP|R40(hard) | 37.6 | ImVoxelNet |
| Object Detection | DAIR-V2X-I | AP|R40(moderate) | 37.6 | ImVoxelNet |
| Object Detection | ScanNetV2 | mAP@0.25 | 48.1 | ImVoxelNet (RGB only) |
| Object Detection | ScanNetV2 | mAP@0.5 | 22.7 | ImVoxelNet (RGB only) |
| Object Detection | SUN RGB-D | AP@0.15 (10 / NYU-37) | 42.69 | ImVoxelNet |
| Object Detection | SUN RGB-D | AP@0.15 (10 / PNet-30) | 48.74 | ImVoxelNet |
| Object Detection | SUN RGB-D | AP@0.15 (NYU-37) | 21.08 | ImVoxelNet |
| 3D | DAIR-V2X-I | AP|R40(easy) | 44.8 | ImVoxelNet |
| 3D | DAIR-V2X-I | AP|R40(hard) | 37.6 | ImVoxelNet |
| 3D | DAIR-V2X-I | AP|R40(moderate) | 37.6 | ImVoxelNet |
| 3D | ScanNetV2 | mAP@0.25 | 48.1 | ImVoxelNet (RGB only) |
| 3D | ScanNetV2 | mAP@0.5 | 22.7 | ImVoxelNet (RGB only) |
| 3D | SUN RGB-D | AP@0.15 (10 / NYU-37) | 42.69 | ImVoxelNet |
| 3D | SUN RGB-D | AP@0.15 (10 / PNet-30) | 48.74 | ImVoxelNet |
| 3D | SUN RGB-D | AP@0.15 (NYU-37) | 21.08 | ImVoxelNet |
| 3D Object Detection | DAIR-V2X-I | AP|R40(easy) | 44.8 | ImVoxelNet |
| 3D Object Detection | DAIR-V2X-I | AP|R40(hard) | 37.6 | ImVoxelNet |
| 3D Object Detection | DAIR-V2X-I | AP|R40(moderate) | 37.6 | ImVoxelNet |
| 3D Object Detection | ScanNetV2 | mAP@0.25 | 48.1 | ImVoxelNet (RGB only) |
| 3D Object Detection | ScanNetV2 | mAP@0.5 | 22.7 | ImVoxelNet (RGB only) |
| 3D Object Detection | SUN RGB-D | AP@0.15 (10 / NYU-37) | 42.69 | ImVoxelNet |
| 3D Object Detection | SUN RGB-D | AP@0.15 (10 / PNet-30) | 48.74 | ImVoxelNet |
| 3D Object Detection | SUN RGB-D | AP@0.15 (NYU-37) | 21.08 | ImVoxelNet |
| 2D Classification | DAIR-V2X-I | AP|R40(easy) | 44.8 | ImVoxelNet |
| 2D Classification | DAIR-V2X-I | AP|R40(hard) | 37.6 | ImVoxelNet |
| 2D Classification | DAIR-V2X-I | AP|R40(moderate) | 37.6 | ImVoxelNet |
| 2D Classification | ScanNetV2 | mAP@0.25 | 48.1 | ImVoxelNet (RGB only) |
| 2D Classification | ScanNetV2 | mAP@0.5 | 22.7 | ImVoxelNet (RGB only) |
| 2D Classification | SUN RGB-D | AP@0.15 (10 / NYU-37) | 42.69 | ImVoxelNet |
| 2D Classification | SUN RGB-D | AP@0.15 (10 / PNet-30) | 48.74 | ImVoxelNet |
| 2D Classification | SUN RGB-D | AP@0.15 (NYU-37) | 21.08 | ImVoxelNet |
| 2D Object Detection | DAIR-V2X-I | AP|R40(easy) | 44.8 | ImVoxelNet |
| 2D Object Detection | DAIR-V2X-I | AP|R40(hard) | 37.6 | ImVoxelNet |
| 2D Object Detection | DAIR-V2X-I | AP|R40(moderate) | 37.6 | ImVoxelNet |
| 2D Object Detection | ScanNetV2 | mAP@0.25 | 48.1 | ImVoxelNet (RGB only) |
| 2D Object Detection | ScanNetV2 | mAP@0.5 | 22.7 | ImVoxelNet (RGB only) |
| 2D Object Detection | SUN RGB-D | AP@0.15 (10 / NYU-37) | 42.69 | ImVoxelNet |
| 2D Object Detection | SUN RGB-D | AP@0.15 (10 / PNet-30) | 48.74 | ImVoxelNet |
| 2D Object Detection | SUN RGB-D | AP@0.15 (NYU-37) | 21.08 | ImVoxelNet |
| 16k | DAIR-V2X-I | AP|R40(easy) | 44.8 | ImVoxelNet |
| 16k | DAIR-V2X-I | AP|R40(hard) | 37.6 | ImVoxelNet |
| 16k | DAIR-V2X-I | AP|R40(moderate) | 37.6 | ImVoxelNet |
| 16k | ScanNetV2 | mAP@0.25 | 48.1 | ImVoxelNet (RGB only) |
| 16k | ScanNetV2 | mAP@0.5 | 22.7 | ImVoxelNet (RGB only) |
| 16k | SUN RGB-D | AP@0.15 (10 / NYU-37) | 42.69 | ImVoxelNet |
| 16k | SUN RGB-D | AP@0.15 (10 / PNet-30) | 48.74 | ImVoxelNet |
| 16k | SUN RGB-D | AP@0.15 (NYU-37) | 21.08 | ImVoxelNet |
| Room Layout Estimation | SUN RGB-D | Camera Pitch | 2.63 | ImVoxelNet |
| Room Layout Estimation | SUN RGB-D | Camera Roll | 1.96 | ImVoxelNet |
| Room Layout Estimation | SUN RGB-D | IoU | 59.3 | ImVoxelNet |