TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LLaVA-3D: A Simple yet Effective Pathway to Empowering LMM...

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, Xihui Liu

2024-09-26Scene Understanding3D Question Answering (3D-QA)
PaperPDF

Abstract

Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to bring the 2D CLIP patches within a 3D spatial context. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)SQA3DExact Match60.1LLaVA-3D
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-416.4LLaVA-3D
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr103.1LLaVA-3D
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match30.6LLaVA-3D
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR20.8LLaVA-3D
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE49.6LLaVA-3D

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation2025-07-15Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander2025-07-15Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding2025-07-10