TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Beyond First Impressions: Integrating Joint Multi-modal Cu...

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Haowei Wang, Jiji Tang, Jiayi Ji, Xiaoshuai Sun, Rongsheng Zhang, Yiwei Ma, Minda Zhao, Lincheng Li, Zeng Zhao, Tangjie Lv, Rongrong Ji

2023-08-06Zero-shot 3D classificationZero-shot 3D Point Cloud Classification3D ClassificationRepresentation LearningZero-shot 3D Point Cloud Classificationclassification3D Part Segmentation3D Point Cloud Classification
PaperPDFCode(official)

Abstract

In recent years, 3D understanding has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image. Specifically, a novel Structured Multimodal Organizer (SMO) is proposed to address the information degradation issue, which introduces contiguous multi-view images and hierarchical text to enrich the representation of vision and language modalities. A Joint Multi-modal Alignment (JMA) is designed to tackle the insufficient synergy problem, which models the joint modality by incorporating language knowledge into the visual modality. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate the effectiveness of our proposed method, JM3D, which achieves state-of-the-art performance in zero-shot 3D classification. JM3D outperforms ULIP by approximately 4.3% on PointMLP and achieves an improvement of up to 6.5% accuracy on PointNet++ in top-1 accuracy for zero-shot 3D classification on ModelNet40. The source code and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/JM3D.

Results

TaskDatasetMetricValueModel
Semantic SegmentationShapeNet-PartClass Average IoU82.1PointNet++ (ssg) + JM3D
Shape Representation Of 3D Point CloudsScanObjectNNMean Accuracy88.7PointMLP∗ + JM3D
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy89.5PointMLP∗ + JM3D
3D Point Cloud ClassificationScanObjectNNMean Accuracy88.7PointMLP∗ + JM3D
3D Point Cloud ClassificationScanObjectNNOverall Accuracy89.5PointMLP∗ + JM3D
10-shot image generationShapeNet-PartClass Average IoU82.1PointNet++ (ssg) + JM3D
3D Point Cloud ReconstructionScanObjectNNMean Accuracy88.7PointMLP∗ + JM3D
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy89.5PointMLP∗ + JM3D

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15Dual Dimensions Geometric Representation Learning Based Document Dewarping2025-07-11