TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/RoboPoint: A Vision-Language Model for Spatial Affordance ...

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox

2024-06-15Spatial ReasoningSynthetic Data GenerationRobot NavigationLanguage Modelling
PaperPDF

Abstract

From rearranging objects on a table to putting groceries into shelves, robots must plan precise action points to perform tasks accurately and reliably. In spite of the recent adoption of vision language models (VLMs) to control robot behavior, VLMs struggle to precisely articulate robot actions using language. We introduce an automatic synthetic data generation pipeline that instruction-tunes VLMs to robotic domains and needs. Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions. Compared to alternative approaches, our method requires no real-world data collection or human demonstration, making it much more scalable to diverse environments and viewpoints. In addition, RoboPoint is a general model that enables several downstream applications such as robot navigation, manipulation, and augmented reality (AR) assistance. Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs (GPT-4o) and visual prompting techniques (PIVOT) by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks. Project website: https://robo-point.github.io.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-abs25.8RoboPoint
Visual Question Answering (VQA)6-DoF SpatialBenchOrientation-rel33.8RoboPoint
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-abs30.8RoboPoint
Visual Question Answering (VQA)6-DoF SpatialBenchPosition-rel43.8RoboPoint
Visual Question Answering (VQA)6-DoF SpatialBenchTotal33.5RoboPoint
Visual Question Answering6-DoF SpatialBenchOrientation-abs25.8RoboPoint
Visual Question Answering6-DoF SpatialBenchOrientation-rel33.8RoboPoint
Visual Question Answering6-DoF SpatialBenchPosition-abs30.8RoboPoint
Visual Question Answering6-DoF SpatialBenchPosition-rel43.8RoboPoint
Visual Question Answering6-DoF SpatialBenchTotal33.5RoboPoint

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17MindJourney: Test-Time Scaling with World Models for Spatial Reasoning2025-07-16Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16