TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SpatialBot: Precise Spatial Understanding with Vision Lang...

SpatialBot: Precise Spatial Understanding with Vision Language Models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, Bo Zhao

2024-06-19Spatial Reasoning
PaperPDFCode(official)

Abstract

Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-abs22.9SpatialBot
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-rel39.6SpatialBot
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-abs21.6SpatialBot
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-rel50.9SpatialBot
Visual Question Answering (VQA)6-DoF SpatialBenchTotal32.7SpatialBot
Visual Question Answering6-DoF SpatialBenchOrientation-abs22.9SpatialBot
Visual Question Answering6-DoF SpatialBenchOrientation-rel39.6SpatialBot
Visual Question Answering6-DoF SpatialBenchPosition-abs21.6SpatialBot
Visual Question Answering6-DoF SpatialBenchPosition-rel50.9SpatialBot
Visual Question Answering6-DoF SpatialBenchTotal32.7SpatialBot

Related Papers

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning2025-07-16EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Warehouse Spatial Question Answering with LLM Agent2025-07-14ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning2025-07-11OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding2025-07-10Scaling RL to Long Videos2025-07-10A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding2025-07-09