Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Duo Zheng, Shijia Huang, LiWei Wang

2024-11-30CVPR 2025 1Scene Understanding 3D Question Answering (3D-QA)

Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. In addition, we have implemented a maximum coverage sampling technique to optimize the trade-off between computational cost and performance. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	SQA3D	Exact Match	58.6	Video-3D LLM
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	102.1	Video-3D LLM
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	30.1	Video-3D LLM

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17 Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation2025-07-15 Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander2025-07-15 Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15 EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14 OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding2025-07-10