Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong
This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | SQA3D | Exact Match | 54.2 | Scene-LLM |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | BLEU-4 | 12 | Scene-LLM |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | CIDEr | 80 | Scene-LLM |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | Exact Match | 27.2 | Scene-LLM |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | METEOR | 16.6 | Scene-LLM |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | ROUGE | 40 | Scene-LLM |