TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Perceiver-Actor: A Multi-Task Transformer for Robotic Mani...

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Mohit Shridhar, Lucas Manuelli, Dieter Fox

2022-09-12Robot ManipulationRobot Manipulation Generalization
PaperPDFCode(official)

Abstract

Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by ``detecting the next best voxel action''. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Results

TaskDatasetMetricValueModel
Robot ManipulationRLBenchInference Speed (fps)4.9PerAct (Evaluated in RVT)
Robot ManipulationRLBenchInput Image Size128PerAct (Evaluated in RVT)
Robot ManipulationRLBenchSucc. Rate (18 tasks, 100 demo/task)49.4PerAct (Evaluated in RVT)
Robot ManipulationRLBenchTraining Time (V100 x 8 x day)16PerAct (Evaluated in RVT)
Robot ManipulationRLBenchInput Image Size128PerAct
Robot ManipulationRLBenchSucc. Rate (18 tasks, 10 demo/task)30PerAct
Robot ManipulationRLBenchSucc. Rate (18 tasks, 100 demo/task)42.7PerAct
Robot ManipulationRLBenchTraining Time (V100 x 8 x day)16PerAct
Robot ManipulationRLBenchInput Image Size128Image-BC VIT
Robot ManipulationRLBenchSucc. Rate (18 tasks, 100 demo/task)1.3Image-BC VIT
Robot ManipulationRLBenchInput Image Size128Image-BC CNN
Robot ManipulationRLBenchSucc. Rate (18 tasks, 100 demo/task)1.3Image-BC CNN
Robot ManipulationThe COLOSSEUMAverage decrease average across all perturbations-17.3PerAct

Related Papers

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge2025-07-06Geometry-aware 4D Video Generation for Robot Manipulation2025-07-01CapsDT: Diffusion-Transformer for Capsule Robot Manipulation2025-06-19Robust Instant Policy: Leveraging Student's t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation2025-06-18SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning2025-06-17What Matters in Learning from Large-Scale Datasets for Robot Manipulation2025-06-16Demonstrating Multi-Suction Item Picking at Scale via Multi-Modal Learning of Pick Success2025-06-12BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models2025-06-09