TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BEVFormer: Learning Bird's-Eye-View Representation from Mu...

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai

2022-03-31Autonomous DrivingBird's-Eye View Semantic SegmentationRobust Camera Only 3D Object Detection3D Object Detection
PaperPDFCode(official)CodeCode

Abstract

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.

Results

TaskDatasetMetricValueModel
Semantic SegmentationnuScenesIoU lane - 224x480 - 100x100 at 0.525.7BEVFormer
Semantic SegmentationnuScenesIoU veh - 224x480 - No vis filter - 100x100 at 0.535.8BEVFormer
Semantic SegmentationnuScenesIoU veh - 224x480 - Vis filter. - 100x100 at 0.542BEVFormer
Semantic SegmentationnuScenesIoU veh - 448x800 - No vis filter - 100x100 at 0.539BEVFormer
Semantic SegmentationnuScenesIoU veh - 448x800 - Vis filter. - 100x100 at 0.545.5BEVFormer
Semantic SegmentationLyft Level 5IoU vehicle - 224x480 - Long44.5BEVFormer (EfficientNet-b4)
Semantic SegmentationLyft Level 5IoU vehicle - 224x480 - Short69.9BEVFormer (EfficientNet-b4)
Semantic SegmentationLyft Level 5IoU vehicle - 224x480 - Long43.2BEVFormer(ResNet-50)
Semantic SegmentationLyft Level 5IoU vehicle - 224x480 - Short68.8BEVFormer(ResNet-50)
Object DetectionnuScenes Camera OnlyNDS56.9BEVFormer
Object DetectionnuScenesNDS0.57BEVFormer
Object DetectionnuScenesmAAE0.13BEVFormer
Object DetectionnuScenesmAOE0.38BEVFormer
Object DetectionnuScenesmAP0.48BEVFormer
Object DetectionnuScenesmASE0.26BEVFormer
Object DetectionnuScenesmATE0.58BEVFormer
Object DetectionnuScenesmAVE0.38BEVFormer
Object DetectionnuScenesNDS0.57BEVFormer
Object DetectionnuScenesmAAE0.13BEVFormer
Object DetectionnuScenesmAOE0.38BEVFormer
Object DetectionnuScenesmAP0.48BEVFormer
Object DetectionnuScenesmASE0.26BEVFormer
Object DetectionnuScenesmATE0.58BEVFormer
Object DetectionnuScenesmAVE0.38BEVFormer
Object DetectionDAIR-V2X-IAP|R40(easy)61.4BEVFormer
Object DetectionDAIR-V2X-IAP|R40(hard)50.7BEVFormer
Object DetectionDAIR-V2X-IAP|R40(moderate)50.7BEVFormer
3DnuScenes Camera OnlyNDS56.9BEVFormer
3DnuScenesNDS0.57BEVFormer
3DnuScenesmAAE0.13BEVFormer
3DnuScenesmAOE0.38BEVFormer
3DnuScenesmAP0.48BEVFormer
3DnuScenesmASE0.26BEVFormer
3DnuScenesmATE0.58BEVFormer
3DnuScenesmAVE0.38BEVFormer
3DnuScenesNDS0.57BEVFormer
3DnuScenesmAAE0.13BEVFormer
3DnuScenesmAOE0.38BEVFormer
3DnuScenesmAP0.48BEVFormer
3DnuScenesmASE0.26BEVFormer
3DnuScenesmATE0.58BEVFormer
3DnuScenesmAVE0.38BEVFormer
3DDAIR-V2X-IAP|R40(easy)61.4BEVFormer
3DDAIR-V2X-IAP|R40(hard)50.7BEVFormer
3DDAIR-V2X-IAP|R40(moderate)50.7BEVFormer
3D Object DetectionnuScenes Camera OnlyNDS56.9BEVFormer
3D Object DetectionnuScenesNDS0.57BEVFormer
3D Object DetectionnuScenesmAAE0.13BEVFormer
3D Object DetectionnuScenesmAOE0.38BEVFormer
3D Object DetectionnuScenesmAP0.48BEVFormer
3D Object DetectionnuScenesmASE0.26BEVFormer
3D Object DetectionnuScenesmATE0.58BEVFormer
3D Object DetectionnuScenesmAVE0.38BEVFormer
3D Object DetectionnuScenesNDS0.57BEVFormer
3D Object DetectionnuScenesmAAE0.13BEVFormer
3D Object DetectionnuScenesmAOE0.38BEVFormer
3D Object DetectionnuScenesmAP0.48BEVFormer
3D Object DetectionnuScenesmASE0.26BEVFormer
3D Object DetectionnuScenesmATE0.58BEVFormer
3D Object DetectionnuScenesmAVE0.38BEVFormer
3D Object DetectionDAIR-V2X-IAP|R40(easy)61.4BEVFormer
3D Object DetectionDAIR-V2X-IAP|R40(hard)50.7BEVFormer
3D Object DetectionDAIR-V2X-IAP|R40(moderate)50.7BEVFormer
2D ClassificationnuScenes Camera OnlyNDS56.9BEVFormer
2D ClassificationnuScenesNDS0.57BEVFormer
2D ClassificationnuScenesmAAE0.13BEVFormer
2D ClassificationnuScenesmAOE0.38BEVFormer
2D ClassificationnuScenesmAP0.48BEVFormer
2D ClassificationnuScenesmASE0.26BEVFormer
2D ClassificationnuScenesmATE0.58BEVFormer
2D ClassificationnuScenesmAVE0.38BEVFormer
2D ClassificationnuScenesNDS0.57BEVFormer
2D ClassificationnuScenesmAAE0.13BEVFormer
2D ClassificationnuScenesmAOE0.38BEVFormer
2D ClassificationnuScenesmAP0.48BEVFormer
2D ClassificationnuScenesmASE0.26BEVFormer
2D ClassificationnuScenesmATE0.58BEVFormer
2D ClassificationnuScenesmAVE0.38BEVFormer
2D ClassificationDAIR-V2X-IAP|R40(easy)61.4BEVFormer
2D ClassificationDAIR-V2X-IAP|R40(hard)50.7BEVFormer
2D ClassificationDAIR-V2X-IAP|R40(moderate)50.7BEVFormer
2D Object DetectionnuScenes Camera OnlyNDS56.9BEVFormer
2D Object DetectionnuScenesNDS0.57BEVFormer
2D Object DetectionnuScenesmAAE0.13BEVFormer
2D Object DetectionnuScenesmAOE0.38BEVFormer
2D Object DetectionnuScenesmAP0.48BEVFormer
2D Object DetectionnuScenesmASE0.26BEVFormer
2D Object DetectionnuScenesmATE0.58BEVFormer
2D Object DetectionnuScenesmAVE0.38BEVFormer
2D Object DetectionnuScenesNDS0.57BEVFormer
2D Object DetectionnuScenesmAAE0.13BEVFormer
2D Object DetectionnuScenesmAOE0.38BEVFormer
2D Object DetectionnuScenesmAP0.48BEVFormer
2D Object DetectionnuScenesmASE0.26BEVFormer
2D Object DetectionnuScenesmATE0.58BEVFormer
2D Object DetectionnuScenesmAVE0.38BEVFormer
2D Object DetectionDAIR-V2X-IAP|R40(easy)61.4BEVFormer
2D Object DetectionDAIR-V2X-IAP|R40(hard)50.7BEVFormer
2D Object DetectionDAIR-V2X-IAP|R40(moderate)50.7BEVFormer
10-shot image generationnuScenesIoU lane - 224x480 - 100x100 at 0.525.7BEVFormer
10-shot image generationnuScenesIoU veh - 224x480 - No vis filter - 100x100 at 0.535.8BEVFormer
10-shot image generationnuScenesIoU veh - 224x480 - Vis filter. - 100x100 at 0.542BEVFormer
10-shot image generationnuScenesIoU veh - 448x800 - No vis filter - 100x100 at 0.539BEVFormer
10-shot image generationnuScenesIoU veh - 448x800 - Vis filter. - 100x100 at 0.545.5BEVFormer
10-shot image generationLyft Level 5IoU vehicle - 224x480 - Long44.5BEVFormer (EfficientNet-b4)
10-shot image generationLyft Level 5IoU vehicle - 224x480 - Short69.9BEVFormer (EfficientNet-b4)
10-shot image generationLyft Level 5IoU vehicle - 224x480 - Long43.2BEVFormer(ResNet-50)
10-shot image generationLyft Level 5IoU vehicle - 224x480 - Short68.8BEVFormer(ResNet-50)
Bird's-Eye View Semantic SegmentationnuScenesIoU lane - 224x480 - 100x100 at 0.525.7BEVFormer
Bird's-Eye View Semantic SegmentationnuScenesIoU veh - 224x480 - No vis filter - 100x100 at 0.535.8BEVFormer
Bird's-Eye View Semantic SegmentationnuScenesIoU veh - 224x480 - Vis filter. - 100x100 at 0.542BEVFormer
Bird's-Eye View Semantic SegmentationnuScenesIoU veh - 448x800 - No vis filter - 100x100 at 0.539BEVFormer
Bird's-Eye View Semantic SegmentationnuScenesIoU veh - 448x800 - Vis filter. - 100x100 at 0.545.5BEVFormer
Bird's-Eye View Semantic SegmentationLyft Level 5IoU vehicle - 224x480 - Long44.5BEVFormer (EfficientNet-b4)
Bird's-Eye View Semantic SegmentationLyft Level 5IoU vehicle - 224x480 - Short69.9BEVFormer (EfficientNet-b4)
Bird's-Eye View Semantic SegmentationLyft Level 5IoU vehicle - 224x480 - Long43.2BEVFormer(ResNet-50)
Bird's-Eye View Semantic SegmentationLyft Level 5IoU vehicle - 224x480 - Short68.8BEVFormer(ResNet-50)
16knuScenes Camera OnlyNDS56.9BEVFormer
16knuScenesNDS0.57BEVFormer
16knuScenesmAAE0.13BEVFormer
16knuScenesmAOE0.38BEVFormer
16knuScenesmAP0.48BEVFormer
16knuScenesmASE0.26BEVFormer
16knuScenesmATE0.58BEVFormer
16knuScenesmAVE0.38BEVFormer
16knuScenesNDS0.57BEVFormer
16knuScenesmAAE0.13BEVFormer
16knuScenesmAOE0.38BEVFormer
16knuScenesmAP0.48BEVFormer
16knuScenesmASE0.26BEVFormer
16knuScenesmATE0.58BEVFormer
16knuScenesmAVE0.38BEVFormer
16kDAIR-V2X-IAP|R40(easy)61.4BEVFormer
16kDAIR-V2X-IAP|R40(hard)50.7BEVFormer
16kDAIR-V2X-IAP|R40(moderate)50.7BEVFormer

Related Papers

GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving2025-07-19AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework2025-07-18World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models2025-07-17Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Safeguarding Federated Learning-based Road Condition Classification2025-07-16