Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai
3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | nuScenes | IoU lane - 224x480 - 100x100 at 0.5 | 25.7 | BEVFormer |
| Semantic Segmentation | nuScenes | IoU veh - 224x480 - No vis filter - 100x100 at 0.5 | 35.8 | BEVFormer |
| Semantic Segmentation | nuScenes | IoU veh - 224x480 - Vis filter. - 100x100 at 0.5 | 42 | BEVFormer |
| Semantic Segmentation | nuScenes | IoU veh - 448x800 - No vis filter - 100x100 at 0.5 | 39 | BEVFormer |
| Semantic Segmentation | nuScenes | IoU veh - 448x800 - Vis filter. - 100x100 at 0.5 | 45.5 | BEVFormer |
| Semantic Segmentation | Lyft Level 5 | IoU vehicle - 224x480 - Long | 44.5 | BEVFormer (EfficientNet-b4) |
| Semantic Segmentation | Lyft Level 5 | IoU vehicle - 224x480 - Short | 69.9 | BEVFormer (EfficientNet-b4) |
| Semantic Segmentation | Lyft Level 5 | IoU vehicle - 224x480 - Long | 43.2 | BEVFormer(ResNet-50) |
| Semantic Segmentation | Lyft Level 5 | IoU vehicle - 224x480 - Short | 68.8 | BEVFormer(ResNet-50) |
| Object Detection | nuScenes Camera Only | NDS | 56.9 | BEVFormer |
| Object Detection | nuScenes | NDS | 0.57 | BEVFormer |
| Object Detection | nuScenes | mAAE | 0.13 | BEVFormer |
| Object Detection | nuScenes | mAOE | 0.38 | BEVFormer |
| Object Detection | nuScenes | mAP | 0.48 | BEVFormer |
| Object Detection | nuScenes | mASE | 0.26 | BEVFormer |
| Object Detection | nuScenes | mATE | 0.58 | BEVFormer |
| Object Detection | nuScenes | mAVE | 0.38 | BEVFormer |
| Object Detection | nuScenes | NDS | 0.57 | BEVFormer |
| Object Detection | nuScenes | mAAE | 0.13 | BEVFormer |
| Object Detection | nuScenes | mAOE | 0.38 | BEVFormer |
| Object Detection | nuScenes | mAP | 0.48 | BEVFormer |
| Object Detection | nuScenes | mASE | 0.26 | BEVFormer |
| Object Detection | nuScenes | mATE | 0.58 | BEVFormer |
| Object Detection | nuScenes | mAVE | 0.38 | BEVFormer |
| Object Detection | DAIR-V2X-I | AP|R40(easy) | 61.4 | BEVFormer |
| Object Detection | DAIR-V2X-I | AP|R40(hard) | 50.7 | BEVFormer |
| Object Detection | DAIR-V2X-I | AP|R40(moderate) | 50.7 | BEVFormer |
| 3D | nuScenes Camera Only | NDS | 56.9 | BEVFormer |
| 3D | nuScenes | NDS | 0.57 | BEVFormer |
| 3D | nuScenes | mAAE | 0.13 | BEVFormer |
| 3D | nuScenes | mAOE | 0.38 | BEVFormer |
| 3D | nuScenes | mAP | 0.48 | BEVFormer |
| 3D | nuScenes | mASE | 0.26 | BEVFormer |
| 3D | nuScenes | mATE | 0.58 | BEVFormer |
| 3D | nuScenes | mAVE | 0.38 | BEVFormer |
| 3D | nuScenes | NDS | 0.57 | BEVFormer |
| 3D | nuScenes | mAAE | 0.13 | BEVFormer |
| 3D | nuScenes | mAOE | 0.38 | BEVFormer |
| 3D | nuScenes | mAP | 0.48 | BEVFormer |
| 3D | nuScenes | mASE | 0.26 | BEVFormer |
| 3D | nuScenes | mATE | 0.58 | BEVFormer |
| 3D | nuScenes | mAVE | 0.38 | BEVFormer |
| 3D | DAIR-V2X-I | AP|R40(easy) | 61.4 | BEVFormer |
| 3D | DAIR-V2X-I | AP|R40(hard) | 50.7 | BEVFormer |
| 3D | DAIR-V2X-I | AP|R40(moderate) | 50.7 | BEVFormer |
| 3D Object Detection | nuScenes Camera Only | NDS | 56.9 | BEVFormer |
| 3D Object Detection | nuScenes | NDS | 0.57 | BEVFormer |
| 3D Object Detection | nuScenes | mAAE | 0.13 | BEVFormer |
| 3D Object Detection | nuScenes | mAOE | 0.38 | BEVFormer |
| 3D Object Detection | nuScenes | mAP | 0.48 | BEVFormer |
| 3D Object Detection | nuScenes | mASE | 0.26 | BEVFormer |
| 3D Object Detection | nuScenes | mATE | 0.58 | BEVFormer |
| 3D Object Detection | nuScenes | mAVE | 0.38 | BEVFormer |
| 3D Object Detection | nuScenes | NDS | 0.57 | BEVFormer |
| 3D Object Detection | nuScenes | mAAE | 0.13 | BEVFormer |
| 3D Object Detection | nuScenes | mAOE | 0.38 | BEVFormer |
| 3D Object Detection | nuScenes | mAP | 0.48 | BEVFormer |
| 3D Object Detection | nuScenes | mASE | 0.26 | BEVFormer |
| 3D Object Detection | nuScenes | mATE | 0.58 | BEVFormer |
| 3D Object Detection | nuScenes | mAVE | 0.38 | BEVFormer |
| 3D Object Detection | DAIR-V2X-I | AP|R40(easy) | 61.4 | BEVFormer |
| 3D Object Detection | DAIR-V2X-I | AP|R40(hard) | 50.7 | BEVFormer |
| 3D Object Detection | DAIR-V2X-I | AP|R40(moderate) | 50.7 | BEVFormer |
| 2D Classification | nuScenes Camera Only | NDS | 56.9 | BEVFormer |
| 2D Classification | nuScenes | NDS | 0.57 | BEVFormer |
| 2D Classification | nuScenes | mAAE | 0.13 | BEVFormer |
| 2D Classification | nuScenes | mAOE | 0.38 | BEVFormer |
| 2D Classification | nuScenes | mAP | 0.48 | BEVFormer |
| 2D Classification | nuScenes | mASE | 0.26 | BEVFormer |
| 2D Classification | nuScenes | mATE | 0.58 | BEVFormer |
| 2D Classification | nuScenes | mAVE | 0.38 | BEVFormer |
| 2D Classification | nuScenes | NDS | 0.57 | BEVFormer |
| 2D Classification | nuScenes | mAAE | 0.13 | BEVFormer |
| 2D Classification | nuScenes | mAOE | 0.38 | BEVFormer |
| 2D Classification | nuScenes | mAP | 0.48 | BEVFormer |
| 2D Classification | nuScenes | mASE | 0.26 | BEVFormer |
| 2D Classification | nuScenes | mATE | 0.58 | BEVFormer |
| 2D Classification | nuScenes | mAVE | 0.38 | BEVFormer |
| 2D Classification | DAIR-V2X-I | AP|R40(easy) | 61.4 | BEVFormer |
| 2D Classification | DAIR-V2X-I | AP|R40(hard) | 50.7 | BEVFormer |
| 2D Classification | DAIR-V2X-I | AP|R40(moderate) | 50.7 | BEVFormer |
| 2D Object Detection | nuScenes Camera Only | NDS | 56.9 | BEVFormer |
| 2D Object Detection | nuScenes | NDS | 0.57 | BEVFormer |
| 2D Object Detection | nuScenes | mAAE | 0.13 | BEVFormer |
| 2D Object Detection | nuScenes | mAOE | 0.38 | BEVFormer |
| 2D Object Detection | nuScenes | mAP | 0.48 | BEVFormer |
| 2D Object Detection | nuScenes | mASE | 0.26 | BEVFormer |
| 2D Object Detection | nuScenes | mATE | 0.58 | BEVFormer |
| 2D Object Detection | nuScenes | mAVE | 0.38 | BEVFormer |
| 2D Object Detection | nuScenes | NDS | 0.57 | BEVFormer |
| 2D Object Detection | nuScenes | mAAE | 0.13 | BEVFormer |
| 2D Object Detection | nuScenes | mAOE | 0.38 | BEVFormer |
| 2D Object Detection | nuScenes | mAP | 0.48 | BEVFormer |
| 2D Object Detection | nuScenes | mASE | 0.26 | BEVFormer |
| 2D Object Detection | nuScenes | mATE | 0.58 | BEVFormer |
| 2D Object Detection | nuScenes | mAVE | 0.38 | BEVFormer |
| 2D Object Detection | DAIR-V2X-I | AP|R40(easy) | 61.4 | BEVFormer |
| 2D Object Detection | DAIR-V2X-I | AP|R40(hard) | 50.7 | BEVFormer |
| 2D Object Detection | DAIR-V2X-I | AP|R40(moderate) | 50.7 | BEVFormer |
| 10-shot image generation | nuScenes | IoU lane - 224x480 - 100x100 at 0.5 | 25.7 | BEVFormer |
| 10-shot image generation | nuScenes | IoU veh - 224x480 - No vis filter - 100x100 at 0.5 | 35.8 | BEVFormer |
| 10-shot image generation | nuScenes | IoU veh - 224x480 - Vis filter. - 100x100 at 0.5 | 42 | BEVFormer |
| 10-shot image generation | nuScenes | IoU veh - 448x800 - No vis filter - 100x100 at 0.5 | 39 | BEVFormer |
| 10-shot image generation | nuScenes | IoU veh - 448x800 - Vis filter. - 100x100 at 0.5 | 45.5 | BEVFormer |
| 10-shot image generation | Lyft Level 5 | IoU vehicle - 224x480 - Long | 44.5 | BEVFormer (EfficientNet-b4) |
| 10-shot image generation | Lyft Level 5 | IoU vehicle - 224x480 - Short | 69.9 | BEVFormer (EfficientNet-b4) |
| 10-shot image generation | Lyft Level 5 | IoU vehicle - 224x480 - Long | 43.2 | BEVFormer(ResNet-50) |
| 10-shot image generation | Lyft Level 5 | IoU vehicle - 224x480 - Short | 68.8 | BEVFormer(ResNet-50) |
| Bird's-Eye View Semantic Segmentation | nuScenes | IoU lane - 224x480 - 100x100 at 0.5 | 25.7 | BEVFormer |
| Bird's-Eye View Semantic Segmentation | nuScenes | IoU veh - 224x480 - No vis filter - 100x100 at 0.5 | 35.8 | BEVFormer |
| Bird's-Eye View Semantic Segmentation | nuScenes | IoU veh - 224x480 - Vis filter. - 100x100 at 0.5 | 42 | BEVFormer |
| Bird's-Eye View Semantic Segmentation | nuScenes | IoU veh - 448x800 - No vis filter - 100x100 at 0.5 | 39 | BEVFormer |
| Bird's-Eye View Semantic Segmentation | nuScenes | IoU veh - 448x800 - Vis filter. - 100x100 at 0.5 | 45.5 | BEVFormer |
| Bird's-Eye View Semantic Segmentation | Lyft Level 5 | IoU vehicle - 224x480 - Long | 44.5 | BEVFormer (EfficientNet-b4) |
| Bird's-Eye View Semantic Segmentation | Lyft Level 5 | IoU vehicle - 224x480 - Short | 69.9 | BEVFormer (EfficientNet-b4) |
| Bird's-Eye View Semantic Segmentation | Lyft Level 5 | IoU vehicle - 224x480 - Long | 43.2 | BEVFormer(ResNet-50) |
| Bird's-Eye View Semantic Segmentation | Lyft Level 5 | IoU vehicle - 224x480 - Short | 68.8 | BEVFormer(ResNet-50) |
| 16k | nuScenes Camera Only | NDS | 56.9 | BEVFormer |
| 16k | nuScenes | NDS | 0.57 | BEVFormer |
| 16k | nuScenes | mAAE | 0.13 | BEVFormer |
| 16k | nuScenes | mAOE | 0.38 | BEVFormer |
| 16k | nuScenes | mAP | 0.48 | BEVFormer |
| 16k | nuScenes | mASE | 0.26 | BEVFormer |
| 16k | nuScenes | mATE | 0.58 | BEVFormer |
| 16k | nuScenes | mAVE | 0.38 | BEVFormer |
| 16k | nuScenes | NDS | 0.57 | BEVFormer |
| 16k | nuScenes | mAAE | 0.13 | BEVFormer |
| 16k | nuScenes | mAOE | 0.38 | BEVFormer |
| 16k | nuScenes | mAP | 0.48 | BEVFormer |
| 16k | nuScenes | mASE | 0.26 | BEVFormer |
| 16k | nuScenes | mATE | 0.58 | BEVFormer |
| 16k | nuScenes | mAVE | 0.38 | BEVFormer |
| 16k | DAIR-V2X-I | AP|R40(easy) | 61.4 | BEVFormer |
| 16k | DAIR-V2X-I | AP|R40(hard) | 50.7 | BEVFormer |
| 16k | DAIR-V2X-I | AP|R40(moderate) | 50.7 | BEVFormer |