Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, Adrien Gaidon
Recent progress in 3D object detection from single images leverages monocular depth estimation as a way to produce 3D pointclouds, turning cameras into pseudo-lidar sensors. These two-stage detectors improve with the accuracy of the intermediate depth estimation network, which can itself be improved without manual labels via large-scale self-supervised learning. However, they tend to suffer from overfitting more than end-to-end methods, are more complex, and the gap with similar lidar-based detectors remains significant. In this work, we propose an end-to-end, single stage, monocular 3D object detector, DD3D, that can benefit from depth pre-training like pseudo-lidar methods, but without their limitations. Our architecture is designed for effective information transfer between depth estimation and 3D detection, allowing us to scale with the amount of unlabeled pre-training data. Our method achieves state-of-the-art results on two challenging benchmarks, with 16.34% and 9.28% AP for Cars and Pedestrians (respectively) on the KITTI-3D benchmark, and 41.5% mAP on NuScenes.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | KITTI Cars Easy | AP Easy | 23.22 | DD3D |
| Object Detection | KITTI Cars Moderate | AP Medium | 16.34 | DD3D |
| Object Detection | KITTI Pedestrian Easy | AP Easy | 13.91 | DD3D |
| Object Detection | KITTI Pedestrian Hard | AP Hard | 8.05 | DD3D |
| Object Detection | KITTI Pedestrian Moderate | AP Medium | 9.3 | DD3D |
| Object Detection | KITTI Cars Hard | AP Hard | 14.2 | DD3D |
| 3D | KITTI Cars Easy | AP Easy | 23.22 | DD3D |
| 3D | KITTI Cars Moderate | AP Medium | 16.34 | DD3D |
| 3D | KITTI Pedestrian Easy | AP Easy | 13.91 | DD3D |
| 3D | KITTI Pedestrian Hard | AP Hard | 8.05 | DD3D |
| 3D | KITTI Pedestrian Moderate | AP Medium | 9.3 | DD3D |
| 3D | KITTI Cars Hard | AP Hard | 14.2 | DD3D |
| 3D Object Detection | KITTI Cars Easy | AP Easy | 23.22 | DD3D |
| 3D Object Detection | KITTI Cars Moderate | AP Medium | 16.34 | DD3D |
| 3D Object Detection | KITTI Pedestrian Easy | AP Easy | 13.91 | DD3D |
| 3D Object Detection | KITTI Pedestrian Hard | AP Hard | 8.05 | DD3D |
| 3D Object Detection | KITTI Pedestrian Moderate | AP Medium | 9.3 | DD3D |
| 3D Object Detection | KITTI Cars Hard | AP Hard | 14.2 | DD3D |
| 2D Classification | KITTI Cars Easy | AP Easy | 23.22 | DD3D |
| 2D Classification | KITTI Cars Moderate | AP Medium | 16.34 | DD3D |
| 2D Classification | KITTI Pedestrian Easy | AP Easy | 13.91 | DD3D |
| 2D Classification | KITTI Pedestrian Hard | AP Hard | 8.05 | DD3D |
| 2D Classification | KITTI Pedestrian Moderate | AP Medium | 9.3 | DD3D |
| 2D Classification | KITTI Cars Hard | AP Hard | 14.2 | DD3D |
| 2D Object Detection | KITTI Cars Easy | AP Easy | 23.22 | DD3D |
| 2D Object Detection | KITTI Cars Moderate | AP Medium | 16.34 | DD3D |
| 2D Object Detection | KITTI Pedestrian Easy | AP Easy | 13.91 | DD3D |
| 2D Object Detection | KITTI Pedestrian Hard | AP Hard | 8.05 | DD3D |
| 2D Object Detection | KITTI Pedestrian Moderate | AP Medium | 9.3 | DD3D |
| 2D Object Detection | KITTI Cars Hard | AP Hard | 14.2 | DD3D |
| 16k | KITTI Cars Easy | AP Easy | 23.22 | DD3D |
| 16k | KITTI Cars Moderate | AP Medium | 16.34 | DD3D |
| 16k | KITTI Pedestrian Easy | AP Easy | 13.91 | DD3D |
| 16k | KITTI Pedestrian Hard | AP Hard | 8.05 | DD3D |
| 16k | KITTI Pedestrian Moderate | AP Medium | 9.3 | DD3D |
| 16k | KITTI Cars Hard | AP Hard | 14.2 | DD3D |