Yin Zhou, Oncel Tuzel
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Birds Eye View Object Detection | KITTI Pedestrian Moderate val | AP | 61.05 | VoxelNet |
| Birds Eye View Object Detection | KITTI Cars Hard val | AP | 78.57 | VoxelNet |
| Birds Eye View Object Detection | KITTI Cyclist Hard val | AP | 50.49 | VoxelNet |
| Birds Eye View Object Detection | KITTI Pedestrian Easy val | AP | 65.95 | VoxelNet |
| Birds Eye View Object Detection | KITTI Cyclist Moderate val | AP | 52.18 | VoxelNet |
| Birds Eye View Object Detection | KITTI Cars Hard | AP | 77.39 | VoxelNet |
| Birds Eye View Object Detection | KITTI Pedestrian Hard val | AP | 56.98 | VoxelNet |
| Birds Eye View Object Detection | KITTI Cars Moderate val | AP | 84.81 | VoxelNet |
| Birds Eye View Object Detection | KITTI Cars Easy val | AP | 89.6 | VoxelNet |
| Birds Eye View Object Detection | KITTI Cyclist Easy val | AP | 74.41 | VoxelNet |