TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Scene-LLM: Extending Language Model for 3D Visual Understa...

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong

2024-03-18Question Answering3D Question Answering (3D-QA)Dense CaptioningLanguage Modelling
PaperPDF

Abstract

This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)SQA3DExact Match54.2Scene-LLM
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-412Scene-LLM
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr80Scene-LLM
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match27.2Scene-LLM
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR16.6Scene-LLM
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE40Scene-LLM

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17