Instruction-driven history-aware policies for robotic manipulations

Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

2022-09-11Robot Manipulation Robot Manipulation Generalization

Abstract

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations while (iii) keeping track of the full history of observations and actions. Such an approach enables learning dependencies between history and instructions and improves manipulation precision using multiple views. We evaluate our method on the challenging RLBench benchmark and on a real-world robot. Notably, our approach scales to 74 diverse RLBench tasks and outperforms the state of the art. We also address instruction-conditioned tasks and demonstrate excellent generalization to previously unseen variations.

Results

Task	Dataset	Metric	Value	Model
Robot Manipulation	RLBench	Succ. Rate (10 tasks, 100 demos/task)	83.3	Hiveformer
Robot Manipulation	RLBench	Succ. Rate (18 tasks, 100 demo/task)	45.3	Hiveformer
Robot Manipulation	GEMBench	Average Success Rate	30.4	Hiveformer

Related Papers

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge2025-07-06 Geometry-aware 4D Video Generation for Robot Manipulation2025-07-01 CapsDT: Diffusion-Transformer for Capsule Robot Manipulation2025-06-19 Robust Instant Policy: Leveraging Student's t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation2025-06-18 SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning2025-06-17 What Matters in Learning from Large-Scale Datasets for Robot Manipulation2025-06-16 Demonstrating Multi-Suction Item Picking at Scale via Multi-Modal Learning of Pick Success2025-06-12 BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models2025-06-09