Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, Zeming Li
In this research, we propose a new 3D object detector with a trustworthy depth estimation, dubbed BEVDepth, for camera-based Bird's-Eye-View (BEV) 3D object detection. Our work is based on a key observation -- depth estimation in recent approaches is surprisingly inadequate given the fact that depth is essential to camera 3D detection. Our BEVDepth resolves this by leveraging explicit depth supervision. A camera-awareness depth estimation module is also introduced to facilitate the depth predicting capability. Besides, we design a novel Depth Refinement Module to counter the side effects carried by imprecise feature unprojection. Aided by customized Efficient Voxel Pooling and multi-frame mechanism, BEVDepth achieves the new state-of-the-art 60.9% NDS on the challenging nuScenes test set while maintaining high efficiency. For the first time, the NDS score of a camera model reaches 60%.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | nuScenes Camera Only | NDS | 60.9 | BEVDepth-pure |
| Object Detection | Rope3D | AP@0.7 | 42.56 | BEVDepth |
| Object Detection | DAIR-V2X-I | AP|R40(easy) | 75.7 | BEVDepth |
| Object Detection | DAIR-V2X-I | AP|R40(hard) | 63.7 | BEVDepth |
| Object Detection | DAIR-V2X-I | AP|R40(moderate) | 63.6 | BEVDepth |
| 3D | nuScenes Camera Only | NDS | 60.9 | BEVDepth-pure |
| 3D | Rope3D | AP@0.7 | 42.56 | BEVDepth |
| 3D | DAIR-V2X-I | AP|R40(easy) | 75.7 | BEVDepth |
| 3D | DAIR-V2X-I | AP|R40(hard) | 63.7 | BEVDepth |
| 3D | DAIR-V2X-I | AP|R40(moderate) | 63.6 | BEVDepth |
| 3D Object Detection | nuScenes Camera Only | NDS | 60.9 | BEVDepth-pure |
| 3D Object Detection | Rope3D | AP@0.7 | 42.56 | BEVDepth |
| 3D Object Detection | DAIR-V2X-I | AP|R40(easy) | 75.7 | BEVDepth |
| 3D Object Detection | DAIR-V2X-I | AP|R40(hard) | 63.7 | BEVDepth |
| 3D Object Detection | DAIR-V2X-I | AP|R40(moderate) | 63.6 | BEVDepth |
| 2D Classification | nuScenes Camera Only | NDS | 60.9 | BEVDepth-pure |
| 2D Classification | Rope3D | AP@0.7 | 42.56 | BEVDepth |
| 2D Classification | DAIR-V2X-I | AP|R40(easy) | 75.7 | BEVDepth |
| 2D Classification | DAIR-V2X-I | AP|R40(hard) | 63.7 | BEVDepth |
| 2D Classification | DAIR-V2X-I | AP|R40(moderate) | 63.6 | BEVDepth |
| 2D Object Detection | nuScenes Camera Only | NDS | 60.9 | BEVDepth-pure |
| 2D Object Detection | Rope3D | AP@0.7 | 42.56 | BEVDepth |
| 2D Object Detection | DAIR-V2X-I | AP|R40(easy) | 75.7 | BEVDepth |
| 2D Object Detection | DAIR-V2X-I | AP|R40(hard) | 63.7 | BEVDepth |
| 2D Object Detection | DAIR-V2X-I | AP|R40(moderate) | 63.6 | BEVDepth |
| 16k | nuScenes Camera Only | NDS | 60.9 | BEVDepth-pure |
| 16k | Rope3D | AP@0.7 | 42.56 | BEVDepth |
| 16k | DAIR-V2X-I | AP|R40(easy) | 75.7 | BEVDepth |
| 16k | DAIR-V2X-I | AP|R40(hard) | 63.7 | BEVDepth |
| 16k | DAIR-V2X-I | AP|R40(moderate) | 63.6 | BEVDepth |