Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | 3D MM-Vet | Overall Accuracy | 53.1 | ShapeLLM-13B |
| Visual Question Answering (VQA) | 3D MM-Vet | Overall Accuracy | 47.4 | ShapeLLM-7B |
| 3D | Objaverse | Objaverse (Average) | 54.5 | ShapeLLM-7B |
| 3D | Objaverse | Objaverse (Average) | 54 | ShapeLLM-13B |
| 3D | ModelNet40 | ModelNet40 (Average) | 53.08 | ShapeLLM-7B |
| 3D | ModelNet40 | ModelNet40 (Average) | 52.96 | ShapeLLM-13B |
| Shape Representation Of 3D Point Clouds | ScanObjectNN | OBJ-BG (OA) | 98.8 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ScanObjectNN | OBJ-ONLY (OA) | 97.59 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ScanObjectNN | Overall Accuracy | 95.25 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 | Overall Accuracy | 95 | ReCon++ |
| Shape Representation Of 3D Point Clouds | Objaverse | Objaverse (Average) | 54.5 | ShapeLLM-7B |
| Shape Representation Of 3D Point Clouds | Objaverse | Objaverse (Average) | 54 | ShapeLLM-13B |
| Shape Representation Of 3D Point Clouds | ModelNet40 | ModelNet40 (Average) | 53.08 | ShapeLLM-7B |
| Shape Representation Of 3D Point Clouds | ModelNet40 | ModelNet40 (Average) | 52.96 | ShapeLLM-13B |
| Shape Representation Of 3D Point Clouds | ModelNet40 10-way (20-shot) | Overall Accuracy | 96.5 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 10-way (20-shot) | Standard Deviation | 3 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 5-way (10-shot) | Overall Accuracy | 98 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 5-way (10-shot) | Standard Deviation | 2.3 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 10-way (10-shot) | Overall Accuracy | 94.5 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 10-way (10-shot) | Standard Deviation | 4.1 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 5-way (20-shot) | Overall Accuracy | 99.5 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 5-way (20-shot) | Standard Deviation | 0.8 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ScanObjectNN | OBJ_ONLY Accuracy(%) | 65.4 | ReCon++ |
| Shape Representation Of 3D Point Clouds | ModelNet40 | Accuracy (%) | 87.3 | ReCon++ |
| 3D Object Classification | Objaverse | Objaverse (Average) | 54.5 | ShapeLLM-7B |
| 3D Object Classification | Objaverse | Objaverse (Average) | 54 | ShapeLLM-13B |
| 3D Object Classification | ModelNet40 | ModelNet40 (Average) | 53.08 | ShapeLLM-7B |
| 3D Object Classification | ModelNet40 | ModelNet40 (Average) | 52.96 | ShapeLLM-13B |
| 3D Point Cloud Classification | ScanObjectNN | OBJ-BG (OA) | 98.8 | ReCon++ |
| 3D Point Cloud Classification | ScanObjectNN | OBJ-ONLY (OA) | 97.59 | ReCon++ |
| 3D Point Cloud Classification | ScanObjectNN | Overall Accuracy | 95.25 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 | Overall Accuracy | 95 | ReCon++ |
| 3D Point Cloud Classification | Objaverse | Objaverse (Average) | 54.5 | ShapeLLM-7B |
| 3D Point Cloud Classification | Objaverse | Objaverse (Average) | 54 | ShapeLLM-13B |
| 3D Point Cloud Classification | ModelNet40 | ModelNet40 (Average) | 53.08 | ShapeLLM-7B |
| 3D Point Cloud Classification | ModelNet40 | ModelNet40 (Average) | 52.96 | ShapeLLM-13B |
| 3D Point Cloud Classification | ModelNet40 10-way (20-shot) | Overall Accuracy | 96.5 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 10-way (20-shot) | Standard Deviation | 3 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 5-way (10-shot) | Overall Accuracy | 98 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 5-way (10-shot) | Standard Deviation | 2.3 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 10-way (10-shot) | Overall Accuracy | 94.5 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 10-way (10-shot) | Standard Deviation | 4.1 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 5-way (20-shot) | Overall Accuracy | 99.5 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 5-way (20-shot) | Standard Deviation | 0.8 | ReCon++ |
| 3D Point Cloud Classification | ScanObjectNN | OBJ_ONLY Accuracy(%) | 65.4 | ReCon++ |
| 3D Point Cloud Classification | ModelNet40 | Accuracy (%) | 87.3 | ReCon++ |
| 3D Classification | Objaverse | Objaverse (Average) | 54.5 | ShapeLLM-7B |
| 3D Classification | Objaverse | Objaverse (Average) | 54 | ShapeLLM-13B |
| 3D Classification | ModelNet40 | ModelNet40 (Average) | 53.08 | ShapeLLM-7B |
| 3D Classification | ModelNet40 | ModelNet40 (Average) | 52.96 | ShapeLLM-13B |
| 3D Point Cloud Linear Classification | ModelNet40 | Overall Accuracy | 93.6 | ReCon++ |
| 3D Point Cloud Reconstruction | ScanObjectNN | OBJ-BG (OA) | 98.8 | ReCon++ |
| 3D Point Cloud Reconstruction | ScanObjectNN | OBJ-ONLY (OA) | 97.59 | ReCon++ |
| 3D Point Cloud Reconstruction | ScanObjectNN | Overall Accuracy | 95.25 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 | Overall Accuracy | 95 | ReCon++ |
| 3D Point Cloud Reconstruction | Objaverse | Objaverse (Average) | 54.5 | ShapeLLM-7B |
| 3D Point Cloud Reconstruction | Objaverse | Objaverse (Average) | 54 | ShapeLLM-13B |
| 3D Point Cloud Reconstruction | ModelNet40 | ModelNet40 (Average) | 53.08 | ShapeLLM-7B |
| 3D Point Cloud Reconstruction | ModelNet40 | ModelNet40 (Average) | 52.96 | ShapeLLM-13B |
| 3D Point Cloud Reconstruction | ModelNet40 10-way (20-shot) | Overall Accuracy | 96.5 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 10-way (20-shot) | Standard Deviation | 3 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 5-way (10-shot) | Overall Accuracy | 98 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 5-way (10-shot) | Standard Deviation | 2.3 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 10-way (10-shot) | Overall Accuracy | 94.5 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 10-way (10-shot) | Standard Deviation | 4.1 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 5-way (20-shot) | Overall Accuracy | 99.5 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 5-way (20-shot) | Standard Deviation | 0.8 | ReCon++ |
| 3D Point Cloud Reconstruction | ScanObjectNN | OBJ_ONLY Accuracy(%) | 65.4 | ReCon++ |
| 3D Point Cloud Reconstruction | ModelNet40 | Accuracy (%) | 87.3 | ReCon++ |
| Generative 3D Object Classification | Objaverse | Objaverse (Average) | 54.5 | ShapeLLM-7B |
| Generative 3D Object Classification | Objaverse | Objaverse (Average) | 54 | ShapeLLM-13B |
| Generative 3D Object Classification | ModelNet40 | ModelNet40 (Average) | 53.08 | ShapeLLM-7B |
| Generative 3D Object Classification | ModelNet40 | ModelNet40 (Average) | 52.96 | ShapeLLM-13B |
| 3D Object Captioning | Objaverse | Sentence-BERT | 48.52 | ShapeLLM-13B |
| 3D Object Captioning | Objaverse | GPT-4 | 48.94 | ShapeLLM-13B |
| 3D Object Captioning | Objaverse | SimCSE | 49.98 | ShapeLLM-13B |
| 3D Object Captioning | Objaverse | Sentence-BERT | 48.2 | ShapeLLM-7B |
| 3D Object Captioning | Objaverse | GPT-4 | 46.92 | ShapeLLM-7B |
| 3D Object Captioning | Objaverse | SimCSE | 49.23 | ShapeLLM-7B |