Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, LiWei Wang
Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at \url{https://github.com/Haiyang-W/DSVT}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | nuScenes LiDAR only | NDS | 72.7 | DSVT |
| Object Detection | nuScenes LiDAR only | NDS (val) | 71.1 | DSVT |
| Object Detection | nuScenes LiDAR only | mAP | 68.4 | DSVT |
| Object Detection | nuScenes LiDAR only | mAP (val) | 66.4 | DSVT |
| Object Detection | nuScenes | NDS | 0.73 | DSVT |
| Object Detection | nuScenes | mAAE | 0.14 | DSVT |
| Object Detection | nuScenes | mAOE | 0.3 | DSVT |
| Object Detection | nuScenes | mASE | 0.23 | DSVT |
| Object Detection | nuScenes | mATE | 0.25 | DSVT |
| Object Detection | nuScenes | mAVE | 0.25 | DSVT |
| Object Detection | Waymo Open Dataset | mAPH/L2 | 72.1 | DSVT |
| Object Detection | waymo cyclist | APH/L2 | 78 | DSVT(val) |
| Object Detection | waymo vehicle | APH/L2 | 74.1 | DSVT(val) |
| Object Detection | waymo vehicle | L1 mAP | 82.1 | DSVT(val) |
| Object Detection | waymo pedestrian | APH/L2 | 76.4 | DSVT(val) |
| 3D | nuScenes LiDAR only | NDS | 72.7 | DSVT |
| 3D | nuScenes LiDAR only | NDS (val) | 71.1 | DSVT |
| 3D | nuScenes LiDAR only | mAP | 68.4 | DSVT |
| 3D | nuScenes LiDAR only | mAP (val) | 66.4 | DSVT |
| 3D | nuScenes | NDS | 0.73 | DSVT |
| 3D | nuScenes | mAAE | 0.14 | DSVT |
| 3D | nuScenes | mAOE | 0.3 | DSVT |
| 3D | nuScenes | mASE | 0.23 | DSVT |
| 3D | nuScenes | mATE | 0.25 | DSVT |
| 3D | nuScenes | mAVE | 0.25 | DSVT |
| 3D | Waymo Open Dataset | mAPH/L2 | 72.1 | DSVT |
| 3D | waymo cyclist | APH/L2 | 78 | DSVT(val) |
| 3D | waymo vehicle | APH/L2 | 74.1 | DSVT(val) |
| 3D | waymo vehicle | L1 mAP | 82.1 | DSVT(val) |
| 3D | waymo pedestrian | APH/L2 | 76.4 | DSVT(val) |
| 3D Object Detection | nuScenes LiDAR only | NDS | 72.7 | DSVT |
| 3D Object Detection | nuScenes LiDAR only | NDS (val) | 71.1 | DSVT |
| 3D Object Detection | nuScenes LiDAR only | mAP | 68.4 | DSVT |
| 3D Object Detection | nuScenes LiDAR only | mAP (val) | 66.4 | DSVT |
| 3D Object Detection | nuScenes | NDS | 0.73 | DSVT |
| 3D Object Detection | nuScenes | mAAE | 0.14 | DSVT |
| 3D Object Detection | nuScenes | mAOE | 0.3 | DSVT |
| 3D Object Detection | nuScenes | mASE | 0.23 | DSVT |
| 3D Object Detection | nuScenes | mATE | 0.25 | DSVT |
| 3D Object Detection | nuScenes | mAVE | 0.25 | DSVT |
| 3D Object Detection | Waymo Open Dataset | mAPH/L2 | 72.1 | DSVT |
| 3D Object Detection | waymo cyclist | APH/L2 | 78 | DSVT(val) |
| 3D Object Detection | waymo vehicle | APH/L2 | 74.1 | DSVT(val) |
| 3D Object Detection | waymo vehicle | L1 mAP | 82.1 | DSVT(val) |
| 3D Object Detection | waymo pedestrian | APH/L2 | 76.4 | DSVT(val) |
| 2D Classification | nuScenes LiDAR only | NDS | 72.7 | DSVT |
| 2D Classification | nuScenes LiDAR only | NDS (val) | 71.1 | DSVT |
| 2D Classification | nuScenes LiDAR only | mAP | 68.4 | DSVT |
| 2D Classification | nuScenes LiDAR only | mAP (val) | 66.4 | DSVT |
| 2D Classification | nuScenes | NDS | 0.73 | DSVT |
| 2D Classification | nuScenes | mAAE | 0.14 | DSVT |
| 2D Classification | nuScenes | mAOE | 0.3 | DSVT |
| 2D Classification | nuScenes | mASE | 0.23 | DSVT |
| 2D Classification | nuScenes | mATE | 0.25 | DSVT |
| 2D Classification | nuScenes | mAVE | 0.25 | DSVT |
| 2D Classification | Waymo Open Dataset | mAPH/L2 | 72.1 | DSVT |
| 2D Classification | waymo cyclist | APH/L2 | 78 | DSVT(val) |
| 2D Classification | waymo vehicle | APH/L2 | 74.1 | DSVT(val) |
| 2D Classification | waymo vehicle | L1 mAP | 82.1 | DSVT(val) |
| 2D Classification | waymo pedestrian | APH/L2 | 76.4 | DSVT(val) |
| 2D Object Detection | nuScenes LiDAR only | NDS | 72.7 | DSVT |
| 2D Object Detection | nuScenes LiDAR only | NDS (val) | 71.1 | DSVT |
| 2D Object Detection | nuScenes LiDAR only | mAP | 68.4 | DSVT |
| 2D Object Detection | nuScenes LiDAR only | mAP (val) | 66.4 | DSVT |
| 2D Object Detection | nuScenes | NDS | 0.73 | DSVT |
| 2D Object Detection | nuScenes | mAAE | 0.14 | DSVT |
| 2D Object Detection | nuScenes | mAOE | 0.3 | DSVT |
| 2D Object Detection | nuScenes | mASE | 0.23 | DSVT |
| 2D Object Detection | nuScenes | mATE | 0.25 | DSVT |
| 2D Object Detection | nuScenes | mAVE | 0.25 | DSVT |
| 2D Object Detection | Waymo Open Dataset | mAPH/L2 | 72.1 | DSVT |
| 2D Object Detection | waymo cyclist | APH/L2 | 78 | DSVT(val) |
| 2D Object Detection | waymo vehicle | APH/L2 | 74.1 | DSVT(val) |
| 2D Object Detection | waymo vehicle | L1 mAP | 82.1 | DSVT(val) |
| 2D Object Detection | waymo pedestrian | APH/L2 | 76.4 | DSVT(val) |
| 16k | nuScenes LiDAR only | NDS | 72.7 | DSVT |
| 16k | nuScenes LiDAR only | NDS (val) | 71.1 | DSVT |
| 16k | nuScenes LiDAR only | mAP | 68.4 | DSVT |
| 16k | nuScenes LiDAR only | mAP (val) | 66.4 | DSVT |
| 16k | nuScenes | NDS | 0.73 | DSVT |
| 16k | nuScenes | mAAE | 0.14 | DSVT |
| 16k | nuScenes | mAOE | 0.3 | DSVT |
| 16k | nuScenes | mASE | 0.23 | DSVT |
| 16k | nuScenes | mATE | 0.25 | DSVT |
| 16k | nuScenes | mAVE | 0.25 | DSVT |
| 16k | Waymo Open Dataset | mAPH/L2 | 72.1 | DSVT |
| 16k | waymo cyclist | APH/L2 | 78 | DSVT(val) |
| 16k | waymo vehicle | APH/L2 | 74.1 | DSVT(val) |
| 16k | waymo vehicle | L1 mAP | 82.1 | DSVT(val) |
| 16k | waymo pedestrian | APH/L2 | 76.4 | DSVT(val) |