BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai

2022-03-31Autonomous Driving Bird's-Eye View Semantic Segmentation Robust Camera Only 3D Object Detection 3D Object Detection

Paper PDF Code(official)Code Code

Abstract

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	nuScenes	IoU lane - 224x480 - 100x100 at 0.5	25.7	BEVFormer
Semantic Segmentation	nuScenes	IoU veh - 224x480 - No vis filter - 100x100 at 0.5	35.8	BEVFormer
Semantic Segmentation	nuScenes	IoU veh - 224x480 - Vis filter. - 100x100 at 0.5	42	BEVFormer
Semantic Segmentation	nuScenes	IoU veh - 448x800 - No vis filter - 100x100 at 0.5	39	BEVFormer
Semantic Segmentation	nuScenes	IoU veh - 448x800 - Vis filter. - 100x100 at 0.5	45.5	BEVFormer
Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Long	44.5	BEVFormer (EfficientNet-b4)
Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Short	69.9	BEVFormer (EfficientNet-b4)
Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Long	43.2	BEVFormer(ResNet-50)
Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Short	68.8	BEVFormer(ResNet-50)
Object Detection	nuScenes Camera Only	NDS	56.9	BEVFormer
Object Detection	nuScenes	NDS	0.57	BEVFormer
Object Detection	nuScenes	mAAE	0.13	BEVFormer
Object Detection	nuScenes	mAOE	0.38	BEVFormer
Object Detection	nuScenes	mAP	0.48	BEVFormer
Object Detection	nuScenes	mASE	0.26	BEVFormer
Object Detection	nuScenes	mATE	0.58	BEVFormer
Object Detection	nuScenes	mAVE	0.38	BEVFormer
Object Detection	nuScenes	NDS	0.57	BEVFormer
Object Detection	nuScenes	mAAE	0.13	BEVFormer
Object Detection	nuScenes	mAOE	0.38	BEVFormer
Object Detection	nuScenes	mAP	0.48	BEVFormer
Object Detection	nuScenes	mASE	0.26	BEVFormer
Object Detection	nuScenes	mATE	0.58	BEVFormer
Object Detection	nuScenes	mAVE	0.38	BEVFormer
Object Detection	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
Object Detection	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
Object Detection	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
3D	nuScenes Camera Only	NDS	56.9	BEVFormer
3D	nuScenes	NDS	0.57	BEVFormer
3D	nuScenes	mAAE	0.13	BEVFormer
3D	nuScenes	mAOE	0.38	BEVFormer
3D	nuScenes	mAP	0.48	BEVFormer
3D	nuScenes	mASE	0.26	BEVFormer
3D	nuScenes	mATE	0.58	BEVFormer
3D	nuScenes	mAVE	0.38	BEVFormer
3D	nuScenes	NDS	0.57	BEVFormer
3D	nuScenes	mAAE	0.13	BEVFormer
3D	nuScenes	mAOE	0.38	BEVFormer
3D	nuScenes	mAP	0.48	BEVFormer
3D	nuScenes	mASE	0.26	BEVFormer
3D	nuScenes	mATE	0.58	BEVFormer
3D	nuScenes	mAVE	0.38	BEVFormer
3D	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
3D	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
3D	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
3D Object Detection	nuScenes Camera Only	NDS	56.9	BEVFormer
3D Object Detection	nuScenes	NDS	0.57	BEVFormer
3D Object Detection	nuScenes	mAAE	0.13	BEVFormer
3D Object Detection	nuScenes	mAOE	0.38	BEVFormer
3D Object Detection	nuScenes	mAP	0.48	BEVFormer
3D Object Detection	nuScenes	mASE	0.26	BEVFormer
3D Object Detection	nuScenes	mATE	0.58	BEVFormer
3D Object Detection	nuScenes	mAVE	0.38	BEVFormer
3D Object Detection	nuScenes	NDS	0.57	BEVFormer
3D Object Detection	nuScenes	mAAE	0.13	BEVFormer
3D Object Detection	nuScenes	mAOE	0.38	BEVFormer
3D Object Detection	nuScenes	mAP	0.48	BEVFormer
3D Object Detection	nuScenes	mASE	0.26	BEVFormer
3D Object Detection	nuScenes	mATE	0.58	BEVFormer
3D Object Detection	nuScenes	mAVE	0.38	BEVFormer
3D Object Detection	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
3D Object Detection	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
3D Object Detection	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
2D Classification	nuScenes Camera Only	NDS	56.9	BEVFormer
2D Classification	nuScenes	NDS	0.57	BEVFormer
2D Classification	nuScenes	mAAE	0.13	BEVFormer
2D Classification	nuScenes	mAOE	0.38	BEVFormer
2D Classification	nuScenes	mAP	0.48	BEVFormer
2D Classification	nuScenes	mASE	0.26	BEVFormer
2D Classification	nuScenes	mATE	0.58	BEVFormer
2D Classification	nuScenes	mAVE	0.38	BEVFormer
2D Classification	nuScenes	NDS	0.57	BEVFormer
2D Classification	nuScenes	mAAE	0.13	BEVFormer
2D Classification	nuScenes	mAOE	0.38	BEVFormer
2D Classification	nuScenes	mAP	0.48	BEVFormer
2D Classification	nuScenes	mASE	0.26	BEVFormer
2D Classification	nuScenes	mATE	0.58	BEVFormer
2D Classification	nuScenes	mAVE	0.38	BEVFormer
2D Classification	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
2D Classification	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
2D Classification	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
2D Object Detection	nuScenes Camera Only	NDS	56.9	BEVFormer
2D Object Detection	nuScenes	NDS	0.57	BEVFormer
2D Object Detection	nuScenes	mAAE	0.13	BEVFormer
2D Object Detection	nuScenes	mAOE	0.38	BEVFormer
2D Object Detection	nuScenes	mAP	0.48	BEVFormer
2D Object Detection	nuScenes	mASE	0.26	BEVFormer
2D Object Detection	nuScenes	mATE	0.58	BEVFormer
2D Object Detection	nuScenes	mAVE	0.38	BEVFormer
2D Object Detection	nuScenes	NDS	0.57	BEVFormer
2D Object Detection	nuScenes	mAAE	0.13	BEVFormer
2D Object Detection	nuScenes	mAOE	0.38	BEVFormer
2D Object Detection	nuScenes	mAP	0.48	BEVFormer
2D Object Detection	nuScenes	mASE	0.26	BEVFormer
2D Object Detection	nuScenes	mATE	0.58	BEVFormer
2D Object Detection	nuScenes	mAVE	0.38	BEVFormer
2D Object Detection	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
2D Object Detection	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
2D Object Detection	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
10-shot image generation	nuScenes	IoU lane - 224x480 - 100x100 at 0.5	25.7	BEVFormer
10-shot image generation	nuScenes	IoU veh - 224x480 - No vis filter - 100x100 at 0.5	35.8	BEVFormer
10-shot image generation	nuScenes	IoU veh - 224x480 - Vis filter. - 100x100 at 0.5	42	BEVFormer
10-shot image generation	nuScenes	IoU veh - 448x800 - No vis filter - 100x100 at 0.5	39	BEVFormer
10-shot image generation	nuScenes	IoU veh - 448x800 - Vis filter. - 100x100 at 0.5	45.5	BEVFormer
10-shot image generation	Lyft Level 5	IoU vehicle - 224x480 - Long	44.5	BEVFormer (EfficientNet-b4)
10-shot image generation	Lyft Level 5	IoU vehicle - 224x480 - Short	69.9	BEVFormer (EfficientNet-b4)
10-shot image generation	Lyft Level 5	IoU vehicle - 224x480 - Long	43.2	BEVFormer(ResNet-50)
10-shot image generation	Lyft Level 5	IoU vehicle - 224x480 - Short	68.8	BEVFormer(ResNet-50)
Bird's-Eye View Semantic Segmentation	nuScenes	IoU lane - 224x480 - 100x100 at 0.5	25.7	BEVFormer
Bird's-Eye View Semantic Segmentation	nuScenes	IoU veh - 224x480 - No vis filter - 100x100 at 0.5	35.8	BEVFormer
Bird's-Eye View Semantic Segmentation	nuScenes	IoU veh - 224x480 - Vis filter. - 100x100 at 0.5	42	BEVFormer
Bird's-Eye View Semantic Segmentation	nuScenes	IoU veh - 448x800 - No vis filter - 100x100 at 0.5	39	BEVFormer
Bird's-Eye View Semantic Segmentation	nuScenes	IoU veh - 448x800 - Vis filter. - 100x100 at 0.5	45.5	BEVFormer
Bird's-Eye View Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Long	44.5	BEVFormer (EfficientNet-b4)
Bird's-Eye View Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Short	69.9	BEVFormer (EfficientNet-b4)
Bird's-Eye View Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Long	43.2	BEVFormer(ResNet-50)
Bird's-Eye View Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Short	68.8	BEVFormer(ResNet-50)
16k	nuScenes Camera Only	NDS	56.9	BEVFormer
16k	nuScenes	NDS	0.57	BEVFormer
16k	nuScenes	mAAE	0.13	BEVFormer
16k	nuScenes	mAOE	0.38	BEVFormer
16k	nuScenes	mAP	0.48	BEVFormer
16k	nuScenes	mASE	0.26	BEVFormer
16k	nuScenes	mATE	0.58	BEVFormer
16k	nuScenes	mAVE	0.38	BEVFormer
16k	nuScenes	NDS	0.57	BEVFormer
16k	nuScenes	mAAE	0.13	BEVFormer
16k	nuScenes	mAOE	0.38	BEVFormer
16k	nuScenes	mAP	0.48	BEVFormer
16k	nuScenes	mASE	0.26	BEVFormer
16k	nuScenes	mATE	0.58	BEVFormer
16k	nuScenes	mAVE	0.38	BEVFormer
16k	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
16k	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
16k	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	nuScenes	IoU lane - 224x480 - 100x100 at 0.5	25.7	BEVFormer
Semantic Segmentation	nuScenes	IoU veh - 224x480 - No vis filter - 100x100 at 0.5	35.8	BEVFormer
Semantic Segmentation	nuScenes	IoU veh - 224x480 - Vis filter. - 100x100 at 0.5	42	BEVFormer
Semantic Segmentation	nuScenes	IoU veh - 448x800 - No vis filter - 100x100 at 0.5	39	BEVFormer
Semantic Segmentation	nuScenes	IoU veh - 448x800 - Vis filter. - 100x100 at 0.5	45.5	BEVFormer
Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Long	44.5	BEVFormer (EfficientNet-b4)
Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Short	69.9	BEVFormer (EfficientNet-b4)
Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Long	43.2	BEVFormer(ResNet-50)
Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Short	68.8	BEVFormer(ResNet-50)
Object Detection	nuScenes Camera Only	NDS	56.9	BEVFormer
Object Detection	nuScenes	NDS	0.57	BEVFormer
Object Detection	nuScenes	mAAE	0.13	BEVFormer
Object Detection	nuScenes	mAOE	0.38	BEVFormer
Object Detection	nuScenes	mAP	0.48	BEVFormer
Object Detection	nuScenes	mASE	0.26	BEVFormer
Object Detection	nuScenes	mATE	0.58	BEVFormer
Object Detection	nuScenes	mAVE	0.38	BEVFormer
Object Detection	nuScenes	NDS	0.57	BEVFormer
Object Detection	nuScenes	mAAE	0.13	BEVFormer
Object Detection	nuScenes	mAOE	0.38	BEVFormer
Object Detection	nuScenes	mAP	0.48	BEVFormer
Object Detection	nuScenes	mASE	0.26	BEVFormer
Object Detection	nuScenes	mATE	0.58	BEVFormer
Object Detection	nuScenes	mAVE	0.38	BEVFormer
Object Detection	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
Object Detection	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
Object Detection	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
3D	nuScenes Camera Only	NDS	56.9	BEVFormer
3D	nuScenes	NDS	0.57	BEVFormer
3D	nuScenes	mAAE	0.13	BEVFormer
3D	nuScenes	mAOE	0.38	BEVFormer
3D	nuScenes	mAP	0.48	BEVFormer
3D	nuScenes	mASE	0.26	BEVFormer
3D	nuScenes	mATE	0.58	BEVFormer
3D	nuScenes	mAVE	0.38	BEVFormer
3D	nuScenes	NDS	0.57	BEVFormer
3D	nuScenes	mAAE	0.13	BEVFormer
3D	nuScenes	mAOE	0.38	BEVFormer
3D	nuScenes	mAP	0.48	BEVFormer
3D	nuScenes	mASE	0.26	BEVFormer
3D	nuScenes	mATE	0.58	BEVFormer
3D	nuScenes	mAVE	0.38	BEVFormer
3D	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
3D	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
3D	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
3D Object Detection	nuScenes Camera Only	NDS	56.9	BEVFormer
3D Object Detection	nuScenes	NDS	0.57	BEVFormer
3D Object Detection	nuScenes	mAAE	0.13	BEVFormer
3D Object Detection	nuScenes	mAOE	0.38	BEVFormer
3D Object Detection	nuScenes	mAP	0.48	BEVFormer
3D Object Detection	nuScenes	mASE	0.26	BEVFormer
3D Object Detection	nuScenes	mATE	0.58	BEVFormer
3D Object Detection	nuScenes	mAVE	0.38	BEVFormer
3D Object Detection	nuScenes	NDS	0.57	BEVFormer
3D Object Detection	nuScenes	mAAE	0.13	BEVFormer
3D Object Detection	nuScenes	mAOE	0.38	BEVFormer
3D Object Detection	nuScenes	mAP	0.48	BEVFormer
3D Object Detection	nuScenes	mASE	0.26	BEVFormer
3D Object Detection	nuScenes	mATE	0.58	BEVFormer
3D Object Detection	nuScenes	mAVE	0.38	BEVFormer
3D Object Detection	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
3D Object Detection	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
3D Object Detection	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
2D Classification	nuScenes Camera Only	NDS	56.9	BEVFormer
2D Classification	nuScenes	NDS	0.57	BEVFormer
2D Classification	nuScenes	mAAE	0.13	BEVFormer
2D Classification	nuScenes	mAOE	0.38	BEVFormer
2D Classification	nuScenes	mAP	0.48	BEVFormer
2D Classification	nuScenes	mASE	0.26	BEVFormer
2D Classification	nuScenes	mATE	0.58	BEVFormer
2D Classification	nuScenes	mAVE	0.38	BEVFormer
2D Classification	nuScenes	NDS	0.57	BEVFormer
2D Classification	nuScenes	mAAE	0.13	BEVFormer
2D Classification	nuScenes	mAOE	0.38	BEVFormer
2D Classification	nuScenes	mAP	0.48	BEVFormer
2D Classification	nuScenes	mASE	0.26	BEVFormer
2D Classification	nuScenes	mATE	0.58	BEVFormer
2D Classification	nuScenes	mAVE	0.38	BEVFormer
2D Classification	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
2D Classification	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
2D Classification	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
2D Object Detection	nuScenes Camera Only	NDS	56.9	BEVFormer
2D Object Detection	nuScenes	NDS	0.57	BEVFormer
2D Object Detection	nuScenes	mAAE	0.13	BEVFormer
2D Object Detection	nuScenes	mAOE	0.38	BEVFormer
2D Object Detection	nuScenes	mAP	0.48	BEVFormer
2D Object Detection	nuScenes	mASE	0.26	BEVFormer
2D Object Detection	nuScenes	mATE	0.58	BEVFormer
2D Object Detection	nuScenes	mAVE	0.38	BEVFormer
2D Object Detection	nuScenes	NDS	0.57	BEVFormer
2D Object Detection	nuScenes	mAAE	0.13	BEVFormer
2D Object Detection	nuScenes	mAOE	0.38	BEVFormer
2D Object Detection	nuScenes	mAP	0.48	BEVFormer
2D Object Detection	nuScenes	mASE	0.26	BEVFormer
2D Object Detection	nuScenes	mATE	0.58	BEVFormer
2D Object Detection	nuScenes	mAVE	0.38	BEVFormer
2D Object Detection	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
2D Object Detection	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
2D Object Detection	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer
10-shot image generation	nuScenes	IoU lane - 224x480 - 100x100 at 0.5	25.7	BEVFormer
10-shot image generation	nuScenes	IoU veh - 224x480 - No vis filter - 100x100 at 0.5	35.8	BEVFormer
10-shot image generation	nuScenes	IoU veh - 224x480 - Vis filter. - 100x100 at 0.5	42	BEVFormer
10-shot image generation	nuScenes	IoU veh - 448x800 - No vis filter - 100x100 at 0.5	39	BEVFormer
10-shot image generation	nuScenes	IoU veh - 448x800 - Vis filter. - 100x100 at 0.5	45.5	BEVFormer
10-shot image generation	Lyft Level 5	IoU vehicle - 224x480 - Long	44.5	BEVFormer (EfficientNet-b4)
10-shot image generation	Lyft Level 5	IoU vehicle - 224x480 - Short	69.9	BEVFormer (EfficientNet-b4)
10-shot image generation	Lyft Level 5	IoU vehicle - 224x480 - Long	43.2	BEVFormer(ResNet-50)
10-shot image generation	Lyft Level 5	IoU vehicle - 224x480 - Short	68.8	BEVFormer(ResNet-50)
Bird's-Eye View Semantic Segmentation	nuScenes	IoU lane - 224x480 - 100x100 at 0.5	25.7	BEVFormer
Bird's-Eye View Semantic Segmentation	nuScenes	IoU veh - 224x480 - No vis filter - 100x100 at 0.5	35.8	BEVFormer
Bird's-Eye View Semantic Segmentation	nuScenes	IoU veh - 224x480 - Vis filter. - 100x100 at 0.5	42	BEVFormer
Bird's-Eye View Semantic Segmentation	nuScenes	IoU veh - 448x800 - No vis filter - 100x100 at 0.5	39	BEVFormer
Bird's-Eye View Semantic Segmentation	nuScenes	IoU veh - 448x800 - Vis filter. - 100x100 at 0.5	45.5	BEVFormer
Bird's-Eye View Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Long	44.5	BEVFormer (EfficientNet-b4)
Bird's-Eye View Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Short	69.9	BEVFormer (EfficientNet-b4)
Bird's-Eye View Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Long	43.2	BEVFormer(ResNet-50)
Bird's-Eye View Semantic Segmentation	Lyft Level 5	IoU vehicle - 224x480 - Short	68.8	BEVFormer(ResNet-50)
16k	nuScenes Camera Only	NDS	56.9	BEVFormer
16k	nuScenes	NDS	0.57	BEVFormer
16k	nuScenes	mAAE	0.13	BEVFormer
16k	nuScenes	mAOE	0.38	BEVFormer
16k	nuScenes	mAP	0.48	BEVFormer
16k	nuScenes	mASE	0.26	BEVFormer
16k	nuScenes	mATE	0.58	BEVFormer
16k	nuScenes	mAVE	0.38	BEVFormer
16k	nuScenes	NDS	0.57	BEVFormer
16k	nuScenes	mAAE	0.13	BEVFormer
16k	nuScenes	mAOE	0.38	BEVFormer
16k	nuScenes	mAP	0.48	BEVFormer
16k	nuScenes	mASE	0.26	BEVFormer
16k	nuScenes	mATE	0.58	BEVFormer
16k	nuScenes	mAVE	0.38	BEVFormer
16k	DAIR-V2X-I	AP\|R40(easy)	61.4	BEVFormer
16k	DAIR-V2X-I	AP\|R40(hard)	50.7	BEVFormer
16k	DAIR-V2X-I	AP\|R40(moderate)	50.7	BEVFormer

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Abstract

Results

Related Papers

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Abstract

Results

Related Papers