TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Larg...

MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors

Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Yixue Hao, Long Hu, Min Chen

2024-05-023D Object Captioningparameter-efficient fine-tuningGenerative 3D Object Classification3D Object Classification
PaperPDFCode(official)

Abstract

Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at https://github.com/TangYuan96/MiniGPT-3D.

Results

TaskDatasetMetricValueModel
3DObjaverseObjaverse (Average)60.25MiniGPT-3D
3DObjaverseObjaverse (C)60.5MiniGPT-3D
3DObjaverseObjaverse (I)60MiniGPT-3D
3DModelNet40ModelNet40 (Average)60.86MiniGPT-3D
3DModelNet40ModelNet40 (C)59.97MiniGPT-3D
3DModelNet40ModelNet40 (I)61.75MiniGPT-3D
Shape Representation Of 3D Point CloudsObjaverseObjaverse (Average)60.25MiniGPT-3D
Shape Representation Of 3D Point CloudsObjaverseObjaverse (C)60.5MiniGPT-3D
Shape Representation Of 3D Point CloudsObjaverseObjaverse (I)60MiniGPT-3D
Shape Representation Of 3D Point CloudsModelNet40ModelNet40 (Average)60.86MiniGPT-3D
Shape Representation Of 3D Point CloudsModelNet40ModelNet40 (C)59.97MiniGPT-3D
Shape Representation Of 3D Point CloudsModelNet40ModelNet40 (I)61.75MiniGPT-3D
3D Object ClassificationObjaverseObjaverse (Average)60.25MiniGPT-3D
3D Object ClassificationObjaverseObjaverse (C)60.5MiniGPT-3D
3D Object ClassificationObjaverseObjaverse (I)60MiniGPT-3D
3D Object ClassificationModelNet40ModelNet40 (Average)60.86MiniGPT-3D
3D Object ClassificationModelNet40ModelNet40 (C)59.97MiniGPT-3D
3D Object ClassificationModelNet40ModelNet40 (I)61.75MiniGPT-3D
3D Point Cloud ClassificationObjaverseObjaverse (Average)60.25MiniGPT-3D
3D Point Cloud ClassificationObjaverseObjaverse (C)60.5MiniGPT-3D
3D Point Cloud ClassificationObjaverseObjaverse (I)60MiniGPT-3D
3D Point Cloud ClassificationModelNet40ModelNet40 (Average)60.86MiniGPT-3D
3D Point Cloud ClassificationModelNet40ModelNet40 (C)59.97MiniGPT-3D
3D Point Cloud ClassificationModelNet40ModelNet40 (I)61.75MiniGPT-3D
3D ClassificationObjaverseObjaverse (Average)60.25MiniGPT-3D
3D ClassificationObjaverseObjaverse (C)60.5MiniGPT-3D
3D ClassificationObjaverseObjaverse (I)60MiniGPT-3D
3D ClassificationModelNet40ModelNet40 (Average)60.86MiniGPT-3D
3D ClassificationModelNet40ModelNet40 (C)59.97MiniGPT-3D
3D ClassificationModelNet40ModelNet40 (I)61.75MiniGPT-3D
3D Point Cloud ReconstructionObjaverseObjaverse (Average)60.25MiniGPT-3D
3D Point Cloud ReconstructionObjaverseObjaverse (C)60.5MiniGPT-3D
3D Point Cloud ReconstructionObjaverseObjaverse (I)60MiniGPT-3D
3D Point Cloud ReconstructionModelNet40ModelNet40 (Average)60.86MiniGPT-3D
3D Point Cloud ReconstructionModelNet40ModelNet40 (C)59.97MiniGPT-3D
3D Point Cloud ReconstructionModelNet40ModelNet40 (I)61.75MiniGPT-3D
Generative 3D Object ClassificationObjaverseObjaverse (Average)60.25MiniGPT-3D
Generative 3D Object ClassificationObjaverseObjaverse (C)60.5MiniGPT-3D
Generative 3D Object ClassificationObjaverseObjaverse (I)60MiniGPT-3D
Generative 3D Object ClassificationModelNet40ModelNet40 (Average)60.86MiniGPT-3D
Generative 3D Object ClassificationModelNet40ModelNet40 (C)59.97MiniGPT-3D
Generative 3D Object ClassificationModelNet40ModelNet40 (I)61.75MiniGPT-3D
3D Object CaptioningObjaverse Sentence-BERT49.54MiniGPT-3D
3D Object CaptioningObjaverseCorrectness3.5MiniGPT-3D
3D Object CaptioningObjaverseGPT-457.06MiniGPT-3D
3D Object CaptioningObjaverseHallucination0.71MiniGPT-3D
3D Object CaptioningObjaversePrecision83.14MiniGPT-3D
3D Object CaptioningObjaverseSimCSE51.39MiniGPT-3D

Related Papers

Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization2025-07-06Exploring Adapter Design Tradeoffs for Low Resource Music Generation2025-06-26WordCon: Word-level Typography Control in Scene Text Rendering2025-06-26Optimising Language Models for Downstream Tasks: A Post-Training Perspective2025-06-26Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models2025-06-26Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models2025-06-26ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs2025-06-23