TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CLIP2Point: Transfer CLIP to Point Cloud Classification wi...

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson W. H. Lau, Wanli Ouyang, WangMeng Zuo

2022-10-03ICCV 2023 1Zero-shot 3D classificationFew-Shot LearningZero-shot 3D Point Cloud ClassificationZero-Shot Transfer 3D Point Cloud ClassificationContrastive Learning3D Point Cloud ClassificationTraining-free 3D Point Cloud ClassificationPoint Cloud Classification
PaperPDFCode(official)

Abstract

Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. However, its performance is restricted by the domain gap between rendered depth maps and images, as well as the diversity of depth distributions. To address this issue, we propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain, and adapt it to point cloud classification. We introduce a new depth rendering setting that forms a better visual effect, and then render 52,460 pairs of images and depth maps from ShapeNet for pre-training. The pre-training scheme of CLIP2Point combines cross-modality learning to enforce the depth features for capturing expressive visual and textual features and intra-modality learning to enhance the invariance of depth aggregation. Additionally, we propose a novel Dual-Path Adapter (DPA) module, i.e., a dual-path structure with simplified adapters for few-shot learning. The dual-path structure allows the joint use of CLIP and CLIP2Point, and the simplified adapter can well fit few-shot tasks without post-search. Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms PointCLIP and other self-supervised 3D networks, achieving state-of-the-art results on zero-shot and few-shot classification.

Results

TaskDatasetMetricValueModel
Shape Representation Of 3D Point CloudsScanObjectNNOBJ_BG Accuracy(%)35.46CLIP2Point
Shape Representation Of 3D Point CloudsScanObjectNNOBJ_ONLY Accuracy(%)30.46CLIP2Point
Shape Representation Of 3D Point CloudsScanObjectNNPB_T50_RS Accuracy (%)23.32CLIP2Point
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)49.38CLIP2Point
Shape Representation Of 3D Point CloudsModelNet10Accuracy (%)66.63CLIP2Point
3D Point Cloud ClassificationScanObjectNNOBJ_BG Accuracy(%)35.46CLIP2Point
3D Point Cloud ClassificationScanObjectNNOBJ_ONLY Accuracy(%)30.46CLIP2Point
3D Point Cloud ClassificationScanObjectNNPB_T50_RS Accuracy (%)23.32CLIP2Point
3D Point Cloud ClassificationModelNet40Accuracy (%)49.38CLIP2Point
3D Point Cloud ClassificationModelNet10Accuracy (%)66.63CLIP2Point
Training-free 3D Point Cloud ClassificationModelNet40Accuracy (%)49.4CLIP2Point
Training-free 3D Point Cloud ClassificationScanObjectNNAccuracy (%)23.2CLIP2Point
3D Point Cloud ReconstructionScanObjectNNOBJ_BG Accuracy(%)35.46CLIP2Point
3D Point Cloud ReconstructionScanObjectNNOBJ_ONLY Accuracy(%)30.46CLIP2Point
3D Point Cloud ReconstructionScanObjectNNPB_T50_RS Accuracy (%)23.32CLIP2Point
3D Point Cloud ReconstructionModelNet40Accuracy (%)49.38CLIP2Point
3D Point Cloud ReconstructionModelNet10Accuracy (%)66.63CLIP2Point

Related Papers

GLAD: Generalizable Tuning for Vision-Language Models2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation2025-07-15Latent Space Consistency for Sparse-View CT Reconstruction2025-07-15