UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, LiWei Wang

2023-08-15ICCV 2023 1Representation Learning Autonomous Driving object-detection 3D Object Detection Object Detection

Abstract

Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR .

Results

Task	Dataset	Metric	Value	Model
Object Detection	nuScenes	NDS	0.75	UniTR
Object Detection	nuScenes	mAAE	0.13	UniTR
Object Detection	nuScenes	mAOE	0.26	UniTR
Object Detection	nuScenes	mAP	0.71	UniTR
Object Detection	nuScenes	mASE	0.23	UniTR
Object Detection	nuScenes	mATE	0.24	UniTR
Object Detection	nuScenes	mAVE	0.24	UniTR
3D	nuScenes	NDS	0.75	UniTR
3D	nuScenes	mAAE	0.13	UniTR
3D	nuScenes	mAOE	0.26	UniTR
3D	nuScenes	mAP	0.71	UniTR
3D	nuScenes	mASE	0.23	UniTR
3D	nuScenes	mATE	0.24	UniTR
3D	nuScenes	mAVE	0.24	UniTR
3D Object Detection	nuScenes	NDS	0.75	UniTR
3D Object Detection	nuScenes	mAAE	0.13	UniTR
3D Object Detection	nuScenes	mAOE	0.26	UniTR
3D Object Detection	nuScenes	mAP	0.71	UniTR
3D Object Detection	nuScenes	mASE	0.23	UniTR
3D Object Detection	nuScenes	mATE	0.24	UniTR
3D Object Detection	nuScenes	mAVE	0.24	UniTR
2D Classification	nuScenes	NDS	0.75	UniTR
2D Classification	nuScenes	mAAE	0.13	UniTR
2D Classification	nuScenes	mAOE	0.26	UniTR
2D Classification	nuScenes	mAP	0.71	UniTR
2D Classification	nuScenes	mASE	0.23	UniTR
2D Classification	nuScenes	mATE	0.24	UniTR
2D Classification	nuScenes	mAVE	0.24	UniTR
2D Object Detection	nuScenes	NDS	0.75	UniTR
2D Object Detection	nuScenes	mAAE	0.13	UniTR
2D Object Detection	nuScenes	mAOE	0.26	UniTR
2D Object Detection	nuScenes	mAP	0.71	UniTR
2D Object Detection	nuScenes	mASE	0.23	UniTR
2D Object Detection	nuScenes	mATE	0.24	UniTR
2D Object Detection	nuScenes	mAVE	0.24	UniTR
16k	nuScenes	NDS	0.75	UniTR
16k	nuScenes	mAAE	0.13	UniTR
16k	nuScenes	mAOE	0.26	UniTR
16k	nuScenes	mAP	0.71	UniTR
16k	nuScenes	mASE	0.23	UniTR
16k	nuScenes	mATE	0.24	UniTR
16k	nuScenes	mAVE	0.24	UniTR

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Abstract

Results

Related Papers

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Abstract

Results

Related Papers