TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Act3D: 3D Feature Field Transformers for Multi-Task Roboti...

Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, Katerina Fragkiadaki

2023-06-30Spatial ReasoningAction DetectionPose PredictionRobot Manipulation
PaperPDFCodeCode(official)

Abstract

3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RL-Bench, an established manipulation benchmark, where it achieves 10% absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and 22% absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments. Code and videos are available on our project website: https://act3d.github.io/.

Results

TaskDatasetMetricValueModel
Robot ManipulationRLBenchInput Image Size256Act3D
Robot ManipulationRLBenchSucc. Rate (18 tasks, 10 demo/task)48Act3D
Robot ManipulationRLBenchSucc. Rate (18 tasks, 100 demo/task)65Act3D
Robot ManipulationRLBenchTraining Time (V100 x 8 x day)5Act3D

Related Papers

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning2025-07-16EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Warehouse Spatial Question Answering with LLM Agent2025-07-14ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning2025-07-11OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding2025-07-10Scaling RL to Long Videos2025-07-10A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding2025-07-09