TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ULIP: Learning a Unified Representation of Language, Image...

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, ran Xu, Juan Carlos Niebles, Silvio Savarese

2022-12-10CVPR 2023 1Zero-shot 3D classificationZero-shot 3D Point Cloud Classification3D Architecture3D ClassificationZero-Shot Transfer 3D Point Cloud ClassificationClassification3D Point Cloud ClassificationLanguage ModellingTraining-free 3D Point Cloud Classification
PaperPDFCode(official)

Abstract

The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models are released at https://github.com/salesforce/ULIP.

Results

TaskDatasetMetricValueModel
Shape Representation Of 3D Point CloudsScanObjectNNMean Accuracy88.6ULIP + PointNeXt
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy89.7ULIP + PointNeXt
Shape Representation Of 3D Point CloudsScanObjectNNMean Accuracy88.5ULIP + PointMLP
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy89.4ULIP + PointMLP
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy86.4ULIP + PointBERT
Shape Representation Of 3D Point CloudsModelNet40Mean Accuracy92.4ULIP + PointMLP
Shape Representation Of 3D Point CloudsModelNet40Overall Accuracy94.7ULIP + PointMLP
Shape Representation Of 3D Point CloudsModelNet40Overall Accuracy94.1ULIP + PointBERT
Shape Representation Of 3D Point CloudsModelNet40Mean Accuracy91.2ULIP + PointNet++(ssg)
Shape Representation Of 3D Point CloudsModelNet40Overall Accuracy93.4ULIP + PointNet++(ssg)
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)61.5ULIP + PointMLP
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)60.4ULIP + PointBERT
3D Point Cloud ClassificationScanObjectNNMean Accuracy88.6ULIP + PointNeXt
3D Point Cloud ClassificationScanObjectNNOverall Accuracy89.7ULIP + PointNeXt
3D Point Cloud ClassificationScanObjectNNMean Accuracy88.5ULIP + PointMLP
3D Point Cloud ClassificationScanObjectNNOverall Accuracy89.4ULIP + PointMLP
3D Point Cloud ClassificationScanObjectNNOverall Accuracy86.4ULIP + PointBERT
3D Point Cloud ClassificationModelNet40Mean Accuracy92.4ULIP + PointMLP
3D Point Cloud ClassificationModelNet40Overall Accuracy94.7ULIP + PointMLP
3D Point Cloud ClassificationModelNet40Overall Accuracy94.1ULIP + PointBERT
3D Point Cloud ClassificationModelNet40Mean Accuracy91.2ULIP + PointNet++(ssg)
3D Point Cloud ClassificationModelNet40Overall Accuracy93.4ULIP + PointNet++(ssg)
3D Point Cloud ClassificationModelNet40Accuracy (%)61.5ULIP + PointMLP
3D Point Cloud ClassificationModelNet40Accuracy (%)60.4ULIP + PointBERT
Training-free 3D Point Cloud ClassificationModelNet40Accuracy (%)60.4ULIP
3D Point Cloud ReconstructionScanObjectNNMean Accuracy88.6ULIP + PointNeXt
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy89.7ULIP + PointNeXt
3D Point Cloud ReconstructionScanObjectNNMean Accuracy88.5ULIP + PointMLP
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy89.4ULIP + PointMLP
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy86.4ULIP + PointBERT
3D Point Cloud ReconstructionModelNet40Mean Accuracy92.4ULIP + PointMLP
3D Point Cloud ReconstructionModelNet40Overall Accuracy94.7ULIP + PointMLP
3D Point Cloud ReconstructionModelNet40Overall Accuracy94.1ULIP + PointBERT
3D Point Cloud ReconstructionModelNet40Mean Accuracy91.2ULIP + PointNet++(ssg)
3D Point Cloud ReconstructionModelNet40Overall Accuracy93.4ULIP + PointNet++(ssg)
3D Point Cloud ReconstructionModelNet40Accuracy (%)61.5ULIP + PointMLP
3D Point Cloud ReconstructionModelNet40Accuracy (%)60.4ULIP + PointBERT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Safeguarding Federated Learning-based Road Condition Classification2025-07-16