TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Swin3D: A Pretrained Transformer Backbone for 3D Indoor Sc...

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, Baining Guo

2023-04-14Scene UnderstandingSegmentationSemantic Segmentation3D Object Detection
PaperPDFCode(official)Code

Abstract

The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets, but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach. The code and models are available at https://github.com/microsoft/Swin3D .

Results

TaskDatasetMetricValueModel
Semantic SegmentationScanNettest mIoU77.9Swin3D-L
Semantic SegmentationScanNetval mIoU77.5Swin3D-L
Semantic SegmentationS3DIS Area5mAcc80.5Swin3D-L
Semantic SegmentationS3DIS Area5mIoU74.5Swin3D-L
Semantic SegmentationS3DIS Area5oAcc92.7Swin3D-L
Semantic SegmentationS3DISMean IoU79.8Swin3D-L
Semantic SegmentationS3DISmAcc88Swin3D-L
Semantic SegmentationS3DISoAcc92.4Swin3D-L
Object DetectionS3DISmAP@0.2572.1Swin3D-L+FCAF3D
Object DetectionS3DISmAP@0.554Swin3D-L+FCAF3D
Object DetectionScanNetV2mAP@0.2576.4Swin3D-L+CAGroup3D
Object DetectionScanNetV2mAP@0.563.2Swin3D-L+CAGroup3D
3DS3DISmAP@0.2572.1Swin3D-L+FCAF3D
3DS3DISmAP@0.554Swin3D-L+FCAF3D
3DScanNetV2mAP@0.2576.4Swin3D-L+CAGroup3D
3DScanNetV2mAP@0.563.2Swin3D-L+CAGroup3D
3D Object DetectionS3DISmAP@0.2572.1Swin3D-L+FCAF3D
3D Object DetectionS3DISmAP@0.554Swin3D-L+FCAF3D
3D Object DetectionScanNetV2mAP@0.2576.4Swin3D-L+CAGroup3D
3D Object DetectionScanNetV2mAP@0.563.2Swin3D-L+CAGroup3D
2D ClassificationS3DISmAP@0.2572.1Swin3D-L+FCAF3D
2D ClassificationS3DISmAP@0.554Swin3D-L+FCAF3D
2D ClassificationScanNetV2mAP@0.2576.4Swin3D-L+CAGroup3D
2D ClassificationScanNetV2mAP@0.563.2Swin3D-L+CAGroup3D
2D Object DetectionS3DISmAP@0.2572.1Swin3D-L+FCAF3D
2D Object DetectionS3DISmAP@0.554Swin3D-L+FCAF3D
2D Object DetectionScanNetV2mAP@0.2576.4Swin3D-L+CAGroup3D
2D Object DetectionScanNetV2mAP@0.563.2Swin3D-L+CAGroup3D
10-shot image generationScanNettest mIoU77.9Swin3D-L
10-shot image generationScanNetval mIoU77.5Swin3D-L
10-shot image generationS3DIS Area5mAcc80.5Swin3D-L
10-shot image generationS3DIS Area5mIoU74.5Swin3D-L
10-shot image generationS3DIS Area5oAcc92.7Swin3D-L
10-shot image generationS3DISMean IoU79.8Swin3D-L
10-shot image generationS3DISmAcc88Swin3D-L
10-shot image generationS3DISoAcc92.4Swin3D-L
16kS3DISmAP@0.2572.1Swin3D-L+FCAF3D
16kS3DISmAP@0.554Swin3D-L+FCAF3D
16kScanNetV2mAP@0.2576.4Swin3D-L+CAGroup3D
16kScanNetV2mAP@0.563.2Swin3D-L+CAGroup3D

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17