Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun
Self-attention networks have revolutionized natural language processing and are making impressive strides in image analysis tasks such as image classification and object detection. Inspired by this success, we investigate the application of self-attention networks to 3D point cloud processing. We design self-attention layers for point clouds and use these to construct self-attention networks for tasks such as semantic scene segmentation, object part segmentation, and object classification. Our Point Transformer design improves upon prior work across domains and tasks. For example, on the challenging S3DIS dataset for large-scale semantic scene segmentation, the Point Transformer attains an mIoU of 70.4% on Area 5, outperforming the strongest prior model by 3.3 absolute percentage points and crossing the 70% mIoU threshold for the first time.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | S3DIS Area5 | mAcc | 76.5 | PointTransformer |
| Semantic Segmentation | S3DIS Area5 | mIoU | 70.4 | PointTransformer |
| Semantic Segmentation | S3DIS Area5 | oAcc | 90.8 | PointTransformer |
| Semantic Segmentation | S3DIS Area5 | mIoU | 57.3 | PointCNN |
| Semantic Segmentation | S3DIS Area5 | mIoU | 41.1 | PointNet |
| Semantic Segmentation | S3DIS | Mean IoU | 73.5 | PointTransformer |
| Semantic Segmentation | S3DIS | Params (M) | 7.8 | PointTransformer |
| Semantic Segmentation | S3DIS | mAcc | 81.9 | PointTransformer |
| Semantic Segmentation | S3DIS | oAcc | 90.2 | PointTransformer |
| Semantic Segmentation | S3DIS | Mean IoU | 70.6 | KPConv |
| Semantic Segmentation | S3DIS | Params (M) | 14.1 | KPConv |
| Semantic Segmentation | S3DIS | Mean IoU | 70.6 | KPConv |
| Semantic Segmentation | S3DIS | Params (M) | 14.1 | KPConv |
| Semantic Segmentation | S3DIS | Mean IoU | 65.4 | PointCNN |
| Semantic Segmentation | S3DIS | Mean IoU | 65.4 | PointCNN |
| Semantic Segmentation | S3DIS | Mean IoU | 62.1 | SPGraph |
| Semantic Segmentation | S3DIS | Mean IoU | 62.1 | SPGraph |
| Semantic Segmentation | S3DIS | Mean IoU | 47.6 | PointNet |
| Semantic Segmentation | S3DIS | Mean IoU | 47.6 | PointNet |
| Semantic Segmentation | STPLS3D | mIOU | 47.64 | Point transformer |
| Semantic Segmentation | S3DIS | mIoU (6-Fold) | 73.5 | PointTransformer |
| Semantic Segmentation | S3DIS | mIoU (Area-5) | 70.4 | PointTransformer |
| Semantic Segmentation | ShapeNet-Part | Class Average IoU | 83.7 | PointTransformer |
| Semantic Segmentation | ShapeNet-Part | Instance Average IoU | 86.6 | PointTransformer |
| Shape Representation Of 3D Point Clouds | ModelNet40 | Mean Accuracy | 90.6 | PointTransformer |
| Shape Representation Of 3D Point Clouds | ModelNet40 | Overall Accuracy | 93.7 | PointTransformer |
| 3D Semantic Segmentation | STPLS3D | mIOU | 47.64 | Point transformer |
| 3D Semantic Segmentation | S3DIS | mIoU (6-Fold) | 73.5 | PointTransformer |
| 3D Semantic Segmentation | S3DIS | mIoU (Area-5) | 70.4 | PointTransformer |
| 3D Point Cloud Classification | ModelNet40 | Mean Accuracy | 90.6 | PointTransformer |
| 3D Point Cloud Classification | ModelNet40 | Overall Accuracy | 93.7 | PointTransformer |
| Point Cloud Segmentation | PointCloud-C | mean Corruption Error (mCE) | 1.049 | PointTransformers |
| 10-shot image generation | S3DIS Area5 | mAcc | 76.5 | PointTransformer |
| 10-shot image generation | S3DIS Area5 | mIoU | 70.4 | PointTransformer |
| 10-shot image generation | S3DIS Area5 | oAcc | 90.8 | PointTransformer |
| 10-shot image generation | S3DIS Area5 | mIoU | 57.3 | PointCNN |
| 10-shot image generation | S3DIS Area5 | mIoU | 41.1 | PointNet |
| 10-shot image generation | S3DIS | Mean IoU | 73.5 | PointTransformer |
| 10-shot image generation | S3DIS | Params (M) | 7.8 | PointTransformer |
| 10-shot image generation | S3DIS | mAcc | 81.9 | PointTransformer |
| 10-shot image generation | S3DIS | oAcc | 90.2 | PointTransformer |
| 10-shot image generation | S3DIS | Mean IoU | 70.6 | KPConv |
| 10-shot image generation | S3DIS | Params (M) | 14.1 | KPConv |
| 10-shot image generation | S3DIS | Mean IoU | 70.6 | KPConv |
| 10-shot image generation | S3DIS | Params (M) | 14.1 | KPConv |
| 10-shot image generation | S3DIS | Mean IoU | 65.4 | PointCNN |
| 10-shot image generation | S3DIS | Mean IoU | 65.4 | PointCNN |
| 10-shot image generation | S3DIS | Mean IoU | 62.1 | SPGraph |
| 10-shot image generation | S3DIS | Mean IoU | 62.1 | SPGraph |
| 10-shot image generation | S3DIS | Mean IoU | 47.6 | PointNet |
| 10-shot image generation | S3DIS | Mean IoU | 47.6 | PointNet |
| 10-shot image generation | STPLS3D | mIOU | 47.64 | Point transformer |
| 10-shot image generation | S3DIS | mIoU (6-Fold) | 73.5 | PointTransformer |
| 10-shot image generation | S3DIS | mIoU (Area-5) | 70.4 | PointTransformer |
| 10-shot image generation | ShapeNet-Part | Class Average IoU | 83.7 | PointTransformer |
| 10-shot image generation | ShapeNet-Part | Instance Average IoU | 86.6 | PointTransformer |
| 3D Point Cloud Reconstruction | ModelNet40 | Mean Accuracy | 90.6 | PointTransformer |
| 3D Point Cloud Reconstruction | ModelNet40 | Overall Accuracy | 93.7 | PointTransformer |