Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, Baining Guo

2023-04-14Scene Understanding Segmentation Semantic Segmentation 3D Object Detection

Abstract

The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets, but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach. The code and models are available at https://github.com/microsoft/Swin3D .

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ScanNet	test mIoU	77.9	Swin3D-L
Semantic Segmentation	ScanNet	val mIoU	77.5	Swin3D-L
Semantic Segmentation	S3DIS Area5	mAcc	80.5	Swin3D-L
Semantic Segmentation	S3DIS Area5	mIoU	74.5	Swin3D-L
Semantic Segmentation	S3DIS Area5	oAcc	92.7	Swin3D-L
Semantic Segmentation	S3DIS	Mean IoU	79.8	Swin3D-L
Semantic Segmentation	S3DIS	mAcc	88	Swin3D-L
Semantic Segmentation	S3DIS	oAcc	92.4	Swin3D-L
Object Detection	S3DIS	mAP@0.25	72.1	Swin3D-L+FCAF3D
Object Detection	S3DIS	mAP@0.5	54	Swin3D-L+FCAF3D
Object Detection	ScanNetV2	mAP@0.25	76.4	Swin3D-L+CAGroup3D
Object Detection	ScanNetV2	mAP@0.5	63.2	Swin3D-L+CAGroup3D
3D	S3DIS	mAP@0.25	72.1	Swin3D-L+FCAF3D
3D	S3DIS	mAP@0.5	54	Swin3D-L+FCAF3D
3D	ScanNetV2	mAP@0.25	76.4	Swin3D-L+CAGroup3D
3D	ScanNetV2	mAP@0.5	63.2	Swin3D-L+CAGroup3D
3D Object Detection	S3DIS	mAP@0.25	72.1	Swin3D-L+FCAF3D
3D Object Detection	S3DIS	mAP@0.5	54	Swin3D-L+FCAF3D
3D Object Detection	ScanNetV2	mAP@0.25	76.4	Swin3D-L+CAGroup3D
3D Object Detection	ScanNetV2	mAP@0.5	63.2	Swin3D-L+CAGroup3D
2D Classification	S3DIS	mAP@0.25	72.1	Swin3D-L+FCAF3D
2D Classification	S3DIS	mAP@0.5	54	Swin3D-L+FCAF3D
2D Classification	ScanNetV2	mAP@0.25	76.4	Swin3D-L+CAGroup3D
2D Classification	ScanNetV2	mAP@0.5	63.2	Swin3D-L+CAGroup3D
2D Object Detection	S3DIS	mAP@0.25	72.1	Swin3D-L+FCAF3D
2D Object Detection	S3DIS	mAP@0.5	54	Swin3D-L+FCAF3D
2D Object Detection	ScanNetV2	mAP@0.25	76.4	Swin3D-L+CAGroup3D
2D Object Detection	ScanNetV2	mAP@0.5	63.2	Swin3D-L+CAGroup3D
10-shot image generation	ScanNet	test mIoU	77.9	Swin3D-L
10-shot image generation	ScanNet	val mIoU	77.5	Swin3D-L
10-shot image generation	S3DIS Area5	mAcc	80.5	Swin3D-L
10-shot image generation	S3DIS Area5	mIoU	74.5	Swin3D-L
10-shot image generation	S3DIS Area5	oAcc	92.7	Swin3D-L
10-shot image generation	S3DIS	Mean IoU	79.8	Swin3D-L
10-shot image generation	S3DIS	mAcc	88	Swin3D-L
10-shot image generation	S3DIS	oAcc	92.4	Swin3D-L
16k	S3DIS	mAP@0.25	72.1	Swin3D-L+FCAF3D
16k	S3DIS	mAP@0.5	54	Swin3D-L+FCAF3D
16k	ScanNetV2	mAP@0.25	76.4	Swin3D-L+CAGroup3D
16k	ScanNetV2	mAP@0.5	63.2	Swin3D-L+CAGroup3D

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Abstract

Results

Related Papers

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Abstract

Results

Related Papers