BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

2022-11-18CVPR 2023 13D Object Detection

Paper PDF Code Code

Abstract

We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.

Results

Task	Dataset	Metric	Value	Model
Object Detection	nuScenes Camera Only	NDS	63.4	BEVFormer v2 (InternImage-XL)
Object Detection	Rope3D	AP@0.7	24.64	BEVFormer
3D	nuScenes Camera Only	NDS	63.4	BEVFormer v2 (InternImage-XL)
3D	Rope3D	AP@0.7	24.64	BEVFormer
3D Object Detection	nuScenes Camera Only	NDS	63.4	BEVFormer v2 (InternImage-XL)
3D Object Detection	Rope3D	AP@0.7	24.64	BEVFormer
2D Classification	nuScenes Camera Only	NDS	63.4	BEVFormer v2 (InternImage-XL)
2D Classification	Rope3D	AP@0.7	24.64	BEVFormer
2D Object Detection	nuScenes Camera Only	NDS	63.4	BEVFormer v2 (InternImage-XL)
2D Object Detection	Rope3D	AP@0.7	24.64	BEVFormer
16k	nuScenes Camera Only	NDS	63.4	BEVFormer v2 (InternImage-XL)
16k	Rope3D	AP@0.7	24.64	BEVFormer

Related Papers

Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17 Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07 MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection2025-07-06 A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects2025-06-24 Teleoperated Driving: a New Challenge for 3D Object Detection in Compressed Point Clouds2025-06-13 Vision-based Lifting of 2D Object Detections for Automated Driving2025-06-13 DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos2025-06-11 Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting2025-06-10