Goto

Collaborating Authors

 tabletop scene


MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Neural Information Processing Systems

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce \textbf{MesaTask-10K}, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with \emph{manually crafted layouts} that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a \textbf{Spatial Reasoning Chain} that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present \textbf{MesaTask}, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts.


MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

arXiv.org Artificial Intelligence

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/


V-PRISM: Probabilistic Mapping of Unknown Tabletop Scenes

arXiv.org Artificial Intelligence

The ability to construct concise scene representations from sensor input is central to the field of robotics. This paper addresses the problem of robustly creating a 3D representation of a tabletop scene from a segmented RGB-D image. These representations are then critical for a range of downstream manipulation tasks. Many previous attempts to tackle this problem do not capture accurate uncertainty, which is required to subsequently produce safe motion plans. In this paper, we cast the representation of 3D tabletop scenes as a multi-class classification problem. To tackle this, we introduce V-PRISM, a framework and method for robustly creating probabilistic 3D segmentation maps of tabletop scenes. Our maps contain both occupancy estimates, segmentation information, and principled uncertainty measures. We evaluate the robustness of our method in (1) procedurally generated scenes using open-source object datasets, and (2) real-world tabletop data collected from a depth camera. Our experiments show that our approach outperforms alternative continuous reconstruction approaches that do not explicitly reason about objects in a multi-class formulation.


RGB-Only Reconstruction of Tabletop Scenes for Collision-Free Manipulator Control

arXiv.org Artificial Intelligence

We present a system for collision-free control of a robot manipulator that uses only RGB views of the world. Perceptual input of a tabletop scene is provided by multiple images of an RGB camera (without depth) that is either handheld or mounted on the robot end effector. A NeRF-like process is used to reconstruct the 3D geometry of the scene, from which the Euclidean full signed distance function (ESDF) is computed. A model predictive control algorithm is then used to control the manipulator to reach a desired pose while avoiding obstacles in the ESDF. We show results on a real dataset collected and annotated in our lab.


Indie Video Games Have Finally Embraced the Tabletop Scene

WIRED

Monster Train has one hell of a premise. Harpies, warlocks, and abyssal knights are constantly breaching the walls, and players have to muster their own infernal forces to consign the interlopers back to the pits. In the hands of a gigantic studio like Blizzard or Ubisoft or EA, it'd be easy to imagine Monster Train as a sprawling, open-world adventure. We'd explore every nook and cranny of perdition--following waypoints, climbing watch towers, maxing out talent trees. But Shiny Shoe, the developer behind the game, chose a different direction entirely.