TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-...

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, Peng Gao

2022-11-21ICCV 2023 1Zero-shot 3D classificationZero-shot 3D Point Cloud ClassificationTraining-free 3D Part Segmentation3D ClassificationDescriptiveZero-Shot Transfer 3D Point Cloud Classification3D Open-Vocabulary Instance SegmentationOpen Vocabulary Object DetectionClassification3D Part Segmentationobject-detection3D Object DetectionObject DetectionTraining-free 3D Point Cloud Classification
PaperPDFCodeCode(official)

Abstract

Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited and only constrained to the classification task. In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. To better align 3D data with the pre-trained language knowledge, PointCLIP V2 contains two key designs. For the visual end, we prompt CLIP via a shape projection module to generate more realistic depth maps, narrowing the domain gap between projected point clouds with natural images. For the textual end, we prompt the GPT model to generate 3D-specific text as the input of CLIP's textual encoder. Without any training in 3D domains, our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification. On top of that, V2 can be extended to few-shot 3D classification, zero-shot 3D part segmentation, and 3D object detection in a simple manner, demonstrating our generalization ability for unified 3D open-world learning.

Results

TaskDatasetMetricValueModel
Shape Representation Of 3D Point CloudsScanObjectNNOBJ_BG Accuracy(%)41.22PointCLIP V2
Shape Representation Of 3D Point CloudsScanObjectNNOBJ_ONLY Accuracy(%)50.09PointCLIP V2
Shape Representation Of 3D Point CloudsScanObjectNNPB_T50_RS Accuracy (%)35.36PointCLIP V2
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)64.22PointCLIP V2
Shape Representation Of 3D Point CloudsModelNet10Accuracy (%)73.13PointCLIP V2
3D Point Cloud ClassificationScanObjectNNOBJ_BG Accuracy(%)41.22PointCLIP V2
3D Point Cloud ClassificationScanObjectNNOBJ_ONLY Accuracy(%)50.09PointCLIP V2
3D Point Cloud ClassificationScanObjectNNPB_T50_RS Accuracy (%)35.36PointCLIP V2
3D Point Cloud ClassificationModelNet40Accuracy (%)64.22PointCLIP V2
3D Point Cloud ClassificationModelNet10Accuracy (%)73.13PointCLIP V2
Training-free 3D Point Cloud ClassificationModelNet40Accuracy (%)64.2PointCLIP V2
Training-free 3D Point Cloud ClassificationScanObjectNNAccuracy (%)35.4PointCLIP V2
Training-free 3D Part SegmentationShapeNet-PartmIoU48.4PointCLIP V2
3D Open-Vocabulary Instance SegmentationSTPLS3DAP503.1PointCLIPV2
3D Point Cloud ReconstructionScanObjectNNOBJ_BG Accuracy(%)41.22PointCLIP V2
3D Point Cloud ReconstructionScanObjectNNOBJ_ONLY Accuracy(%)50.09PointCLIP V2
3D Point Cloud ReconstructionScanObjectNNPB_T50_RS Accuracy (%)35.36PointCLIP V2
3D Point Cloud ReconstructionModelNet40Accuracy (%)64.22PointCLIP V2
3D Point Cloud ReconstructionModelNet10Accuracy (%)73.13PointCLIP V2

Related Papers

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16