3D-LLM: Injecting the 3D World into Large Language Models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan

2023-07-24NeurIPS 2023 113D Object Captioning Question Answering 3D Question Answering (3D-QA)Generative 3D Object Classification Dense Captioning

Paper PDF Code Code Code Code Code

Abstract

Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-1	32.6	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	8.4	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	65.6	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	23.2	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	13.5	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	34.8	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-1	38.3	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	11.6	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	69.6	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	19.1	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	14.9	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	35.3	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-1	37.3	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	10.7	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	67.1	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	19.1	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	14.3	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	34.5	3D-LLM (BLIP2-opt)
3D	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D	Objaverse	Objaverse (C)	41.5	3D-LLM
3D	Objaverse	Objaverse (I)	49	3D-LLM
Shape Representation Of 3D Point Clouds	Objaverse	Objaverse (Average)	45.25	3D-LLM
Shape Representation Of 3D Point Clouds	Objaverse	Objaverse (C)	41.5	3D-LLM
Shape Representation Of 3D Point Clouds	Objaverse	Objaverse (I)	49	3D-LLM
3D Object Classification	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D Object Classification	Objaverse	Objaverse (C)	41.5	3D-LLM
3D Object Classification	Objaverse	Objaverse (I)	49	3D-LLM
3D Point Cloud Classification	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D Point Cloud Classification	Objaverse	Objaverse (C)	41.5	3D-LLM
3D Point Cloud Classification	Objaverse	Objaverse (I)	49	3D-LLM
3D Classification	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D Classification	Objaverse	Objaverse (C)	41.5	3D-LLM
3D Classification	Objaverse	Objaverse (I)	49	3D-LLM
3D Point Cloud Reconstruction	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D Point Cloud Reconstruction	Objaverse	Objaverse (C)	41.5	3D-LLM
3D Point Cloud Reconstruction	Objaverse	Objaverse (I)	49	3D-LLM
Generative 3D Object Classification	Objaverse	Objaverse (Average)	45.25	3D-LLM
Generative 3D Object Classification	Objaverse	Objaverse (C)	41.5	3D-LLM
Generative 3D Object Classification	Objaverse	Objaverse (I)	49	3D-LLM
3D Object Captioning	Objaverse	Sentence-BERT	44.48	3D-LLM
3D Object Captioning	Objaverse	Correctness	1.77	3D-LLM
3D Object Captioning	Objaverse	GPT-4	33.42	3D-LLM
3D Object Captioning	Objaverse	Hallucination	1.16	3D-LLM
3D Object Captioning	Objaverse	Precision	60.39	3D-LLM
3D Object Captioning	Objaverse	SimCSE	43.68	3D-LLM

Abstract

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-1	32.6	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	8.4	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	65.6	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	23.2	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	13.5	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	34.8	3D-LLM (flamingo)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-1	38.3	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	11.6	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	69.6	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	19.1	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	14.9	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	35.3	3D-LLM (BLIP2-flant5)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-1	37.3	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	10.7	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	67.1	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	19.1	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	14.3	3D-LLM (BLIP2-opt)
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	34.5	3D-LLM (BLIP2-opt)
3D	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D	Objaverse	Objaverse (C)	41.5	3D-LLM
3D	Objaverse	Objaverse (I)	49	3D-LLM
Shape Representation Of 3D Point Clouds	Objaverse	Objaverse (Average)	45.25	3D-LLM
Shape Representation Of 3D Point Clouds	Objaverse	Objaverse (C)	41.5	3D-LLM
Shape Representation Of 3D Point Clouds	Objaverse	Objaverse (I)	49	3D-LLM
3D Object Classification	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D Object Classification	Objaverse	Objaverse (C)	41.5	3D-LLM
3D Object Classification	Objaverse	Objaverse (I)	49	3D-LLM
3D Point Cloud Classification	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D Point Cloud Classification	Objaverse	Objaverse (C)	41.5	3D-LLM
3D Point Cloud Classification	Objaverse	Objaverse (I)	49	3D-LLM
3D Classification	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D Classification	Objaverse	Objaverse (C)	41.5	3D-LLM
3D Classification	Objaverse	Objaverse (I)	49	3D-LLM
3D Point Cloud Reconstruction	Objaverse	Objaverse (Average)	45.25	3D-LLM
3D Point Cloud Reconstruction	Objaverse	Objaverse (C)	41.5	3D-LLM
3D Point Cloud Reconstruction	Objaverse	Objaverse (I)	49	3D-LLM
Generative 3D Object Classification	Objaverse	Objaverse (Average)	45.25	3D-LLM
Generative 3D Object Classification	Objaverse	Objaverse (C)	41.5	3D-LLM
Generative 3D Object Classification	Objaverse	Objaverse (I)	49	3D-LLM
3D Object Captioning	Objaverse	Sentence-BERT	44.48	3D-LLM
3D Object Captioning	Objaverse	Correctness	1.77	3D-LLM
3D Object Captioning	Objaverse	GPT-4	33.42	3D-LLM
3D Object Captioning	Objaverse	Hallucination	1.16	3D-LLM
3D Object Captioning	Objaverse	Precision	60.39	3D-LLM
3D Object Captioning	Objaverse	SimCSE	43.68	3D-LLM

3D-LLM: Injecting the 3D World into Large Language Models

Abstract

Results

Related Papers

3D-LLM: Injecting the 3D World into Large Language Models

Abstract

Results

Related Papers