Voxel Transformer for 3D Object Detection

Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi Feng, Xiaodan Liang, Hang Xu, Chunjing Xu

2021-09-06ICCV 2021 10Object Recognition object-detection 3D Object Detection Object Detection

Abstract

We present Voxel Transformer (VoTr), a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds. Conventional 3D convolutional backbones in voxel-based 3D detectors cannot efficiently capture large context information, which is crucial for object recognition and localization, owing to the limited receptive fields. In this paper, we resolve the problem by introducing a Transformer-based architecture that enables long-range relationships between voxels by self-attention. Given the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, we propose the sparse voxel module and the submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, we propose two attention mechanisms for multi-head attention in those two modules: Local Attention and Dilated Attention, and we further propose Fast Voxel Query to accelerate the querying process in multi-head attention. VoTr contains a series of sparse and submanifold voxel modules and can be applied in most voxel-based detectors. Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Waymo Open dataset.

Results

Task	Dataset	Metric	Value	Model
Object Detection	waymo vehicle	L1 mAP	74.95	VoTr-TSD
3D	waymo vehicle	L1 mAP	74.95	VoTr-TSD
3D Object Detection	waymo vehicle	L1 mAP	74.95	VoTr-TSD
2D Classification	waymo vehicle	L1 mAP	74.95	VoTr-TSD
2D Object Detection	waymo vehicle	L1 mAP	74.95	VoTr-TSD
16k	waymo vehicle	L1 mAP	74.95	VoTr-TSD

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17 Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17 Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17 Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16 Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15 GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing2025-07-08 ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08