Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | 3D MM-Vet | Overall Accuracy | 23.5 | Point-Bind & Point-LLM |
| 3D | Objaverse | Objaverse (Average) | 5.25 | Point-Bind LLM |
| 3D | Objaverse | Objaverse (C) | 4.5 | Point-Bind LLM |
| 3D | Objaverse | Objaverse (I) | 6 | Point-Bind LLM |
| 3D | ModelNet40 | ModelNet40 (Average) | 45.81 | Point-Bind LLM |
| Shape Representation Of 3D Point Clouds | Objaverse | Objaverse (Average) | 5.25 | Point-Bind LLM |
| Shape Representation Of 3D Point Clouds | Objaverse | Objaverse (C) | 4.5 | Point-Bind LLM |
| Shape Representation Of 3D Point Clouds | Objaverse | Objaverse (I) | 6 | Point-Bind LLM |
| Shape Representation Of 3D Point Clouds | ModelNet40 | ModelNet40 (Average) | 45.81 | Point-Bind LLM |
| 3D Object Classification | Objaverse | Objaverse (Average) | 5.25 | Point-Bind LLM |
| 3D Object Classification | Objaverse | Objaverse (C) | 4.5 | Point-Bind LLM |
| 3D Object Classification | Objaverse | Objaverse (I) | 6 | Point-Bind LLM |
| 3D Object Classification | ModelNet40 | ModelNet40 (Average) | 45.81 | Point-Bind LLM |
| 3D Point Cloud Classification | Objaverse | Objaverse (Average) | 5.25 | Point-Bind LLM |
| 3D Point Cloud Classification | Objaverse | Objaverse (C) | 4.5 | Point-Bind LLM |
| 3D Point Cloud Classification | Objaverse | Objaverse (I) | 6 | Point-Bind LLM |
| 3D Point Cloud Classification | ModelNet40 | ModelNet40 (Average) | 45.81 | Point-Bind LLM |
| 3D Classification | Objaverse | Objaverse (Average) | 5.25 | Point-Bind LLM |
| 3D Classification | Objaverse | Objaverse (C) | 4.5 | Point-Bind LLM |
| 3D Classification | Objaverse | Objaverse (I) | 6 | Point-Bind LLM |
| 3D Classification | ModelNet40 | ModelNet40 (Average) | 45.81 | Point-Bind LLM |
| 3D Point Cloud Reconstruction | Objaverse | Objaverse (Average) | 5.25 | Point-Bind LLM |
| 3D Point Cloud Reconstruction | Objaverse | Objaverse (C) | 4.5 | Point-Bind LLM |
| 3D Point Cloud Reconstruction | Objaverse | Objaverse (I) | 6 | Point-Bind LLM |
| 3D Point Cloud Reconstruction | ModelNet40 | ModelNet40 (Average) | 45.81 | Point-Bind LLM |
| Generative 3D Object Classification | Objaverse | Objaverse (Average) | 5.25 | Point-Bind LLM |
| Generative 3D Object Classification | Objaverse | Objaverse (C) | 4.5 | Point-Bind LLM |
| Generative 3D Object Classification | Objaverse | Objaverse (I) | 6 | Point-Bind LLM |
| Generative 3D Object Classification | ModelNet40 | ModelNet40 (Average) | 45.81 | Point-Bind LLM |