TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UniTR: A Unified and Efficient Multi-Modal Transformer for...

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, LiWei Wang

2023-08-15ICCV 2023 1Representation LearningAutonomous Drivingobject-detection3D Object DetectionObject Detection
PaperPDFCodeCodeCode(official)

Abstract

Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR .

Results

TaskDatasetMetricValueModel
Object DetectionnuScenesNDS0.75UniTR
Object DetectionnuScenesmAAE0.13UniTR
Object DetectionnuScenesmAOE0.26UniTR
Object DetectionnuScenesmAP0.71UniTR
Object DetectionnuScenesmASE0.23UniTR
Object DetectionnuScenesmATE0.24UniTR
Object DetectionnuScenesmAVE0.24UniTR
3DnuScenesNDS0.75UniTR
3DnuScenesmAAE0.13UniTR
3DnuScenesmAOE0.26UniTR
3DnuScenesmAP0.71UniTR
3DnuScenesmASE0.23UniTR
3DnuScenesmATE0.24UniTR
3DnuScenesmAVE0.24UniTR
3D Object DetectionnuScenesNDS0.75UniTR
3D Object DetectionnuScenesmAAE0.13UniTR
3D Object DetectionnuScenesmAOE0.26UniTR
3D Object DetectionnuScenesmAP0.71UniTR
3D Object DetectionnuScenesmASE0.23UniTR
3D Object DetectionnuScenesmATE0.24UniTR
3D Object DetectionnuScenesmAVE0.24UniTR
2D ClassificationnuScenesNDS0.75UniTR
2D ClassificationnuScenesmAAE0.13UniTR
2D ClassificationnuScenesmAOE0.26UniTR
2D ClassificationnuScenesmAP0.71UniTR
2D ClassificationnuScenesmASE0.23UniTR
2D ClassificationnuScenesmATE0.24UniTR
2D ClassificationnuScenesmAVE0.24UniTR
2D Object DetectionnuScenesNDS0.75UniTR
2D Object DetectionnuScenesmAAE0.13UniTR
2D Object DetectionnuScenesmAOE0.26UniTR
2D Object DetectionnuScenesmAP0.71UniTR
2D Object DetectionnuScenesmASE0.23UniTR
2D Object DetectionnuScenesmATE0.24UniTR
2D Object DetectionnuScenesmAVE0.24UniTR
16knuScenesNDS0.75UniTR
16knuScenesmAAE0.13UniTR
16knuScenesmAOE0.26UniTR
16knuScenesmAP0.71UniTR
16knuScenesmASE0.23UniTR
16knuScenesmATE0.24UniTR
16knuScenesmAVE0.24UniTR

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving2025-07-19AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework2025-07-18Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models2025-07-17Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17