TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/3D-LLM: Injecting the 3D World into Large Language Models

3D-LLM: Injecting the 3D World into Large Language Models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan

2023-07-24NeurIPS 2023 113D Object CaptioningQuestion Answering3D Question Answering (3D-QA)Generative 3D Object ClassificationDense Captioning
PaperPDFCodeCodeCodeCodeCode

Abstract

Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-132.63D-LLM (flamingo)
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-48.43D-LLM (flamingo)
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr65.63D-LLM (flamingo)
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match23.23D-LLM (flamingo)
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR13.53D-LLM (flamingo)
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE34.83D-LLM (flamingo)
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-138.33D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-411.63D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr69.63D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match19.13D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR14.93D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE35.33D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-137.33D-LLM (BLIP2-opt)
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-410.73D-LLM (BLIP2-opt)
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr67.13D-LLM (BLIP2-opt)
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match19.13D-LLM (BLIP2-opt)
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR14.33D-LLM (BLIP2-opt)
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE34.53D-LLM (BLIP2-opt)
3DObjaverseObjaverse (Average)45.253D-LLM
3DObjaverseObjaverse (C)41.53D-LLM
3DObjaverseObjaverse (I)493D-LLM
Shape Representation Of 3D Point CloudsObjaverseObjaverse (Average)45.253D-LLM
Shape Representation Of 3D Point CloudsObjaverseObjaverse (C)41.53D-LLM
Shape Representation Of 3D Point CloudsObjaverseObjaverse (I)493D-LLM
3D Object ClassificationObjaverseObjaverse (Average)45.253D-LLM
3D Object ClassificationObjaverseObjaverse (C)41.53D-LLM
3D Object ClassificationObjaverseObjaverse (I)493D-LLM
3D Point Cloud ClassificationObjaverseObjaverse (Average)45.253D-LLM
3D Point Cloud ClassificationObjaverseObjaverse (C)41.53D-LLM
3D Point Cloud ClassificationObjaverseObjaverse (I)493D-LLM
3D ClassificationObjaverseObjaverse (Average)45.253D-LLM
3D ClassificationObjaverseObjaverse (C)41.53D-LLM
3D ClassificationObjaverseObjaverse (I)493D-LLM
3D Point Cloud ReconstructionObjaverseObjaverse (Average)45.253D-LLM
3D Point Cloud ReconstructionObjaverseObjaverse (C)41.53D-LLM
3D Point Cloud ReconstructionObjaverseObjaverse (I)493D-LLM
Generative 3D Object ClassificationObjaverseObjaverse (Average)45.253D-LLM
Generative 3D Object ClassificationObjaverseObjaverse (C)41.53D-LLM
Generative 3D Object ClassificationObjaverseObjaverse (I)493D-LLM
3D Object CaptioningObjaverse Sentence-BERT44.483D-LLM
3D Object CaptioningObjaverseCorrectness1.773D-LLM
3D Object CaptioningObjaverseGPT-433.423D-LLM
3D Object CaptioningObjaverseHallucination1.163D-LLM
3D Object CaptioningObjaversePrecision60.393D-LLM
3D Object CaptioningObjaverseSimCSE43.683D-LLM

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16Warehouse Spatial Question Answering with LLM Agent2025-07-14Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09