TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Chat-Scene: Bridging 3D Scene and Large Language Models wi...

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao

2023-12-13Question AnsweringAttributeScene Understanding3D Question Answering (3D-QA)
PaperPDFCodeCode(official)

Abstract

Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehension. In this paper, we introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Specifically, we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token, which enables efficient object referencing and grounding during user-assistant interactions. Given the scarcity of scene-language data, we model the scene embeddings as a sequence of explicit object-level embeddings, derived from semantic-rich 2D or 3D representations. By employing object identifiers, we transform diverse 3D scene-language tasks into a unified question-answering format, facilitating joint training without the need for additional task-specific heads. With minimal fine-tuning on all downstream tasks, our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)SQA3DExact Match54.7Chat-3D v2
Visual Question Answering (VQA)SQA3DExact Match54.6ChatScene
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-414.3ChatScene
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr87.7ChatScene
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match21.6ChatScene
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR18ChatScene
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE41.6ChatScene
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-414Chat-3D v2
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr87.6Chat-3D v2

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16