TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SoFar: Language-Grounded Orientation Bridges Spatial Reaso...

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, JiaWei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi

2025-02-18Spatial ReasoningObject RearrangementRobot NavigationRobot ManipulationVisual Question Answering
PaperPDFCodeCode

Abstract

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)EmbSpatial-BenchGeneration70.88SoFar
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-abs31.3SoFar
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-rel54.6SoFar
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-abs33.8SoFar
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-rel59.6SoFar
Visual Question Answering (VQA)6-DoF SpatialBenchTotal43.9SoFar
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation0.676SoFar
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Move Near0.74SoFar
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Open/Close Drawer0.297SoFar
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Pick Coke Can0.907SoFar
Robot ManipulationSimplerEnv-Google RobotVisual Matching0.749SoFar
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Move Near0.917SoFar
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Open/Close Drawer0.403SoFar
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Pick Coke Can0.923SoFar
Robot ManipulationSimplerEnv-Widow XAverage0.583SoFar
Robot ManipulationSimplerEnv-Widow XPut Carrot on Plate0.667SoFar
Robot ManipulationSimplerEnv-Widow XPut Eggplant in Yellow Basket0.375SoFar
Robot ManipulationSimplerEnv-Widow XPut Spoon on Towel0.583SoFar
Robot ManipulationSimplerEnv-Widow XStack Green Block on Yellow Block0.708SoFar
Visual Question AnsweringEmbSpatial-BenchGeneration70.88SoFar
Visual Question Answering6-DoF SpatialBenchOrientation-abs31.3SoFar
Visual Question Answering6-DoF SpatialBenchOrientation-rel54.6SoFar
Visual Question Answering6-DoF SpatialBenchPosition-abs33.8SoFar
Visual Question Answering6-DoF SpatialBenchPosition-rel59.6SoFar
Visual Question Answering6-DoF SpatialBenchTotal43.9SoFar
Object RearrangementOpen6DOR V26-DoF48.7SoFar
Object RearrangementOpen6DOR V2pos-level096SoFar
Object RearrangementOpen6DOR V2pos-level181.5SoFar
Object RearrangementOpen6DOR V2rot-level068.6SoFar
Object RearrangementOpen6DOR V2rot-level142.2SoFar
Object RearrangementOpen6DOR V2rot-level270.1SoFar

Related Papers

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Warehouse Spatial Question Answering with LLM Agent2025-07-14ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning2025-07-11OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding2025-07-10Scaling RL to Long Videos2025-07-10