Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, Hengshuang Zhao
As a pioneering work exploring transformer architecture for 3D point cloud understanding, Point Transformer achieves impressive results on multiple highly competitive benchmarks. In this work, we analyze the limitations of the Point Transformer and propose our powerful and efficient Point Transformer V2 model with novel designs that overcome the limitations of previous work. In particular, we first propose group vector attention, which is more effective than the previous version of vector attention. Inheriting the advantages of both learnable weight encoding and multi-head attention, we present a highly effective implementation of grouped vector attention with a novel grouped weight encoding layer. We also strengthen the position information for attention by an additional position encoding multiplier. Furthermore, we design novel and lightweight partition-based pooling methods which enable better spatial alignment and more efficient sampling. Extensive experiments show that our model achieves better performance than its predecessor and achieves state-of-the-art on several challenging 3D point cloud understanding benchmarks, including 3D point cloud segmentation on ScanNet v2 and S3DIS and 3D point cloud classification on ModelNet40. Our code will be available at https://github.com/Gofinge/PointTransformerV2.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | ScanNet | test mIoU | 75.2 | PTv2 |
| Semantic Segmentation | ScanNet | val mIoU | 75.4 | PTv2 |
| Semantic Segmentation | S3DIS Area5 | mAcc | 78 | PTv2 |
| Semantic Segmentation | S3DIS Area5 | mIoU | 72.6 | PTv2 |
| Semantic Segmentation | S3DIS Area5 | oAcc | 91.6 | PTv2 |
| Semantic Segmentation | ScanNet++ | Top-1 IoU | 0.445 | PTv2 |
| Semantic Segmentation | ScanNet++ | Top-3 IoU | 0.688 | PTv2 |
| Semantic Segmentation | S3DIS | mIoU (Area-5) | 71.6 | PointTransformerV2 |
| Shape Representation Of 3D Point Clouds | ModelNet40 | Mean Accuracy | 91.6 | PTv2 |
| Shape Representation Of 3D Point Clouds | ModelNet40 | Overall Accuracy | 94.2 | PTv2 |
| 3D Semantic Segmentation | ScanNet++ | Top-1 IoU | 0.445 | PTv2 |
| 3D Semantic Segmentation | ScanNet++ | Top-3 IoU | 0.688 | PTv2 |
| 3D Semantic Segmentation | S3DIS | mIoU (Area-5) | 71.6 | PointTransformerV2 |
| 3D Point Cloud Classification | ModelNet40 | Mean Accuracy | 91.6 | PTv2 |
| 3D Point Cloud Classification | ModelNet40 | Overall Accuracy | 94.2 | PTv2 |
| LIDAR Semantic Segmentation | nuScenes | test mIoU | 0.826 | PTv2 |
| LIDAR Semantic Segmentation | nuScenes | val mIoU | 0.802 | PTv2 |
| 10-shot image generation | ScanNet | test mIoU | 75.2 | PTv2 |
| 10-shot image generation | ScanNet | val mIoU | 75.4 | PTv2 |
| 10-shot image generation | S3DIS Area5 | mAcc | 78 | PTv2 |
| 10-shot image generation | S3DIS Area5 | mIoU | 72.6 | PTv2 |
| 10-shot image generation | S3DIS Area5 | oAcc | 91.6 | PTv2 |
| 10-shot image generation | ScanNet++ | Top-1 IoU | 0.445 | PTv2 |
| 10-shot image generation | ScanNet++ | Top-3 IoU | 0.688 | PTv2 |
| 10-shot image generation | S3DIS | mIoU (Area-5) | 71.6 | PointTransformerV2 |
| 3D Point Cloud Reconstruction | ModelNet40 | Mean Accuracy | 91.6 | PTv2 |
| 3D Point Cloud Reconstruction | ModelNet40 | Overall Accuracy | 94.2 | PTv2 |