Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Kapelyukh, Ivan, Ren, Yifei, Alzugaray, Ignacio, Johns, Edward

Dec-7-2023–arXiv.org Artificial Intelligence

We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Dec-7-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.30)
  - Natural Language > Large Language Model (1.00)
  - Robots (1.00)
  - Vision (1.00)