SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, JiaWei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi

2025-02-18Spatial Reasoning Object Rearrangement Robot Navigation Robot Manipulation Visual Question Answering

Paper PDF Code Code

Abstract

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	EmbSpatial-Bench	Generation	70.88	SoFar
Visual Question Answering (VQA)	6-DoF SpatialBench	Orientation-abs	31.3	SoFar
Visual Question Answering (VQA)	6-DoF SpatialBench	Orientation-rel	54.6	SoFar
Visual Question Answering (VQA)	6-DoF SpatialBench	Position-abs	33.8	SoFar
Visual Question Answering (VQA)	6-DoF SpatialBench	Position-rel	59.6	SoFar
Visual Question Answering (VQA)	6-DoF SpatialBench	Total	43.9	SoFar
Robot Manipulation	SimplerEnv-Google Robot	Variant Aggregation	0.676	SoFar
Robot Manipulation	SimplerEnv-Google Robot	Variant Aggregation-Move Near	0.74	SoFar
Robot Manipulation	SimplerEnv-Google Robot	Variant Aggregation-Open/Close Drawer	0.297	SoFar
Robot Manipulation	SimplerEnv-Google Robot	Variant Aggregation-Pick Coke Can	0.907	SoFar
Robot Manipulation	SimplerEnv-Google Robot	Visual Matching	0.749	SoFar
Robot Manipulation	SimplerEnv-Google Robot	Visual Matching-Move Near	0.917	SoFar
Robot Manipulation	SimplerEnv-Google Robot	Visual Matching-Open/Close Drawer	0.403	SoFar
Robot Manipulation	SimplerEnv-Google Robot	Visual Matching-Pick Coke Can	0.923	SoFar
Robot Manipulation	SimplerEnv-Widow X	Average	0.583	SoFar
Robot Manipulation	SimplerEnv-Widow X	Put Carrot on Plate	0.667	SoFar
Robot Manipulation	SimplerEnv-Widow X	Put Eggplant in Yellow Basket	0.375	SoFar
Robot Manipulation	SimplerEnv-Widow X	Put Spoon on Towel	0.583	SoFar
Robot Manipulation	SimplerEnv-Widow X	Stack Green Block on Yellow Block	0.708	SoFar
Visual Question Answering	EmbSpatial-Bench	Generation	70.88	SoFar
Visual Question Answering	6-DoF SpatialBench	Orientation-abs	31.3	SoFar
Visual Question Answering	6-DoF SpatialBench	Orientation-rel	54.6	SoFar
Visual Question Answering	6-DoF SpatialBench	Position-abs	33.8	SoFar
Visual Question Answering	6-DoF SpatialBench	Position-rel	59.6	SoFar
Visual Question Answering	6-DoF SpatialBench	Total	43.9	SoFar
Object Rearrangement	Open6DOR V2	6-DoF	48.7	SoFar
Object Rearrangement	Open6DOR V2	pos-level0	96	SoFar
Object Rearrangement	Open6DOR V2	pos-level1	81.5	SoFar
Object Rearrangement	Open6DOR V2	rot-level0	68.6	SoFar
Object Rearrangement	Open6DOR V2	rot-level1	42.2	SoFar
Object Rearrangement	Open6DOR V2	rot-level2	70.1	SoFar

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Abstract

Results

Related Papers

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Abstract

Results

Related Papers