TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SpatialVLM: Endowing Vision-Language Models with Spatial R...

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia

2024-01-22CVPR 2024 1Spatial ReasoningQuestion AnsweringVisual Question Answering (VQA)Visual Question Answering
PaperPDF

Abstract

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Project website: https://spatial-vlm.github.io/

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-abs25SpaceMantis
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-rel27.2SpaceMantis
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-abs29.2SpaceMantis
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-rel33.6SpaceMantis
Visual Question Answering (VQA)6-DoF SpatialBenchTotal28.9SpaceMantis
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-abs24.9SpaceLLaVA
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-rel30.9SpaceLLaVA
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-abs30.5SpaceLLaVA
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-rel32.4SpaceLLaVA
Visual Question Answering (VQA)6-DoF SpatialBenchTotal28.2SpaceLLaVA
Visual Question Answering6-DoF SpatialBenchOrientation-abs25SpaceMantis
Visual Question Answering6-DoF SpatialBenchOrientation-rel27.2SpaceMantis
Visual Question Answering6-DoF SpatialBenchPosition-abs29.2SpaceMantis
Visual Question Answering6-DoF SpatialBenchPosition-rel33.6SpaceMantis
Visual Question Answering6-DoF SpatialBenchTotal28.9SpaceMantis
Visual Question Answering6-DoF SpatialBenchOrientation-abs24.9SpaceLLaVA
Visual Question Answering6-DoF SpatialBenchOrientation-rel30.9SpaceLLaVA
Visual Question Answering6-DoF SpatialBenchPosition-abs30.5SpaceLLaVA
Visual Question Answering6-DoF SpatialBenchPosition-rel32.4SpaceLLaVA
Visual Question Answering6-DoF SpatialBenchTotal28.2SpaceLLaVA

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MindJourney: Test-Time Scaling with World Models for Spatial Reasoning2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16