TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OpenDlign: Open-World Point Cloud Understanding with Depth...

OpenDlign: Open-World Point Cloud Understanding with Depth-Aligned Images

Ye Mao, Junpeng Jing, Krystian Mikolajczyk

2024-04-25Zero-shot 3D classificationZero-shot 3D Point Cloud ClassificationRepresentation LearningTransfer LearningZero-Shot Transfer 3D Point Cloud ClassificationZero-Shot Learning
PaperPDFCode(official)

Abstract

Recent open-world 3D representation learning methods using Vision-Language Models (VLMs) to align 3D point cloud with image-text information have shown superior 3D zero-shot performance. However, CAD-rendered images for this alignment often lack realism and texture variation, compromising alignment robustness. Moreover, the volume discrepancy between 3D and 2D pretraining datasets highlights the need for effective strategies to transfer the representational abilities of VLMs to 3D learning. In this paper, we present OpenDlign, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment. These images exhibit greater texture diversity than CAD renderings due to the stochastic nature of the diffusion model. By refining the depth map projection pipeline and designing depth-specific prompts, OpenDlign leverages rich knowledge in pre-trained VLM for 3D representation learning with streamlined fine-tuning. Our experiments show that OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset. In zero-shot classification, OpenDlign surpasses previous models by 8.0% on ModelNet40 and 16.4% on OmniObject3D. Additionally, using depth-aligned images for multimodal alignment consistently enhances the performance of other state-of-the-art models.

Results

TaskDatasetMetricValueModel
Shape Representation Of 3D Point CloudsScanObjectNNOBJ_ONLY Accuracy(%)60.5TAMM-PointBERT (+dlign)
Shape Representation Of 3D Point CloudsScanObjectNNOBJ_ONLY Accuracy(%)59.5OpenDlign
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)86.2TAMM-PointBERT (+dlign)
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)85.4OpenShape-PointBERT (+dlign)
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)85OpenShape-SparseConv (+dlign)
Shape Representation Of 3D Point CloudsModelNet40Accuracy (%)82.6OpenDlign
3D Point Cloud ClassificationScanObjectNNOBJ_ONLY Accuracy(%)60.5TAMM-PointBERT (+dlign)
3D Point Cloud ClassificationScanObjectNNOBJ_ONLY Accuracy(%)59.5OpenDlign
3D Point Cloud ClassificationModelNet40Accuracy (%)86.2TAMM-PointBERT (+dlign)
3D Point Cloud ClassificationModelNet40Accuracy (%)85.4OpenShape-PointBERT (+dlign)
3D Point Cloud ClassificationModelNet40Accuracy (%)85OpenShape-SparseConv (+dlign)
3D Point Cloud ClassificationModelNet40Accuracy (%)82.6OpenDlign
3D Point Cloud ReconstructionScanObjectNNOBJ_ONLY Accuracy(%)60.5TAMM-PointBERT (+dlign)
3D Point Cloud ReconstructionScanObjectNNOBJ_ONLY Accuracy(%)59.5OpenDlign
3D Point Cloud ReconstructionModelNet40Accuracy (%)86.2TAMM-PointBERT (+dlign)
3D Point Cloud ReconstructionModelNet40Accuracy (%)85.4OpenShape-PointBERT (+dlign)
3D Point Cloud ReconstructionModelNet40Accuracy (%)85OpenShape-SparseConv (+dlign)
3D Point Cloud ReconstructionModelNet40Accuracy (%)82.6OpenDlign

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16