Ivan Kapelyukh, Yifei Ren, Ignacio Alzugaray, Edward Johns
We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Rearrangement | Open6DOR V2 | 6-DoF | 13.5 | Dream2Real |
| Object Rearrangement | Open6DOR V2 | pos-level0 | 11 | Dream2Real |
| Object Rearrangement | Open6DOR V2 | pos-level1 | 17.2 | Dream2Real |
| Object Rearrangement | Open6DOR V2 | rot-level0 | 37.3 | Dream2Real |
| Object Rearrangement | Open6DOR V2 | rot-level1 | 27.6 | Dream2Real |
| Object Rearrangement | Open6DOR V2 | rot-level2 | 26.2 | Dream2Real |