TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal...

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, Jianwei Yang

2024-12-13Robot ManipulationVision-Language-Action
PaperPDF

Abstract

Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

Results

TaskDatasetMetricValueModel
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation0.45TraceVLA
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Move Near0.564TraceVLA
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Open/Close Drawer0.31TraceVLA
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Pick Coke Can0.6TraceVLA
Robot ManipulationSimplerEnv-Google RobotVisual Matching0.46TraceVLA
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Move Near0.6TraceVLA
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Open/Close Drawer0.24TraceVLA
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Pick Coke Can0.56TraceVLA

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation2025-07-17Vision Language Action Models in Robotic Manipulation: A Systematic Review2025-07-14VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting2025-07-07DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge2025-07-06Geometry-aware 4D Video Generation for Robot Manipulation2025-07-01A Survey on Vision-Language-Action Models for Autonomous Driving2025-06-30WorldVLA: Towards Autoregressive Action World Model2025-06-26