TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PointCLIP: Point Cloud Understanding by CLIP

PointCLIP: Point Cloud Understanding by CLIP

Renrui Zhang, Ziyu Guo, Wei zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, Hongsheng Li

2021-12-04CVPR 2022 1Zero-shot 3D classificationFew-Shot LearningZero-shot 3D Point Cloud ClassificationTraining-free 3D Part SegmentationTransfer LearningZero-Shot Transfer 3D Point Cloud Classification3D Open-Vocabulary Instance SegmentationOpen Vocabulary Object DetectionTraining-free 3D Point Cloud Classification
PaperPDFCode(official)Code

Abstract

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts. Specifically, we encode a point cloud by projecting it into multi-view depth maps without rendering, and aggregate the view-wise zero-shot prediction to achieve knowledge transfer from 2D to 3D. On top of that, we design an inter-view adapter to better extract the global feature and adaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in 2D. By just fine-tuning the lightweight adapter in the few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the complementary property between PointCLIP and classical 3D-supervised networks. By simple ensembling, PointCLIP boosts baseline's performance and even surpasses state-of-the-art models. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime. We conduct thorough experiments on widely-adopted ModelNet10, ModelNet40 and the challenging ScanObjectNN to demonstrate the effectiveness of PointCLIP. The code is released at https://github.com/ZrrSkywalker/PointCLIP.

Results

TaskDatasetMetricValueModel
Shape Representation Of 3D Point CloudsScanObjectNNOBJ_BG Accuracy(%)21.34PointCLIP
Shape Representation Of 3D Point CloudsScanObjectNNOBJ_ONLY Accuracy(%)19.28PointCLIP
Shape Representation Of 3D Point CloudsScanObjectNNPB_T50_RS Accuracy (%)15.38PointCLIP
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)20.18PointCLIP
Shape Representation Of 3D Point CloudsModelNet10Accuracy (%)30.23PointCLIP
3D Point Cloud ClassificationScanObjectNNOBJ_BG Accuracy(%)21.34PointCLIP
3D Point Cloud ClassificationScanObjectNNOBJ_ONLY Accuracy(%)19.28PointCLIP
3D Point Cloud ClassificationScanObjectNNPB_T50_RS Accuracy (%)15.38PointCLIP
3D Point Cloud ClassificationModelNet40Accuracy (%)20.18PointCLIP
3D Point Cloud ClassificationModelNet10Accuracy (%)30.23PointCLIP
Training-free 3D Point Cloud ClassificationModelNet40Accuracy (%)20.2PointCLIP
Training-free 3D Point Cloud ClassificationScanObjectNNAccuracy (%)15.4PointCLIP
Training-free 3D Part SegmentationShapeNet-PartmIoU31PointCLIP
3D Open-Vocabulary Instance SegmentationSTPLS3DAP502.6PointCLIP
3D Point Cloud ReconstructionScanObjectNNOBJ_BG Accuracy(%)21.34PointCLIP
3D Point Cloud ReconstructionScanObjectNNOBJ_ONLY Accuracy(%)19.28PointCLIP
3D Point Cloud ReconstructionScanObjectNNPB_T50_RS Accuracy (%)15.38PointCLIP
3D Point Cloud ReconstructionModelNet40Accuracy (%)20.18PointCLIP
3D Point Cloud ReconstructionModelNet10Accuracy (%)30.23PointCLIP

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16Robust-Multi-Task Gradient Boosting2025-07-15Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift2025-07-12The Bayesian Approach to Continual Learning: An Overview2025-07-11Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection2025-07-10