Eisner, Ben
FlowBotHD: History-Aware Diffuser Handling Ambiguities in Articulated Objects Manipulation
Li, Yishu, Leng, Wen Hui, Fang, Yiming, Eisner, Ben, Held, David
We introduce a novel approach for manipulating articulated objects which are visually ambiguous, such doors which are symmetric or which are heavily occluded. These ambiguities can cause uncertainty over different possible articulation modes: for instance, when the articulation direction (e.g. push, pull, slide) or location (e.g. left side, right side) of a fully closed door are uncertain, or when distinguishing features like the plane of the door are occluded due to the viewing angle. To tackle these challenges, we propose a history-aware diffusion network that can model multi-modal distributions over articulation modes for articulated objects; our method further uses observation history to distinguish between modes and make stable predictions under occlusions. Experiments and analysis demonstrate that our method achieves state-of-art performance on articulated object manipulation and dramatically improves performance for articulated objects containing visual ambiguities. Our project website is available at https://flowbothd.github.io/.
Non-rigid Relative Placement through 3D Dense Diffusion
Cai, Eric, Donca, Octavian, Eisner, Ben, Held, David
The task of "relative placement" is to predict the placement of one object in relation to another, e.g. placing a mug onto a mug rack. Through explicit object-centric geometric reasoning, recent methods for relative placement have made tremendous progress towards data-efficient learning for robot manipulation while generalizing to unseen task variations. However, they have yet to represent deformable transformations, despite the ubiquity of non-rigid bodies in real world settings. As a first step towards bridging this gap, we propose ``cross-displacement" - an extension of the principles of relative placement to geometric relationships between deformable objects - and present a novel vision-based method to learn cross-displacement through dense diffusion. To this end, we demonstrate our method's ability to generalize to unseen object instances, out-of-distribution scene configurations, and multimodal goals on multiple highly deformable tasks (both in simulation and in the real world) beyond the scope of prior works. Supplementary information and videos can be found at https://sites.google.com/view/tax3d-corl-2024 .
Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks
Eisner, Ben, Yang, Yi, Davchev, Todor, Vecerik, Mel, Scholz, Jonathan, Held, David
Many robot manipulation tasks can be framed as geometric reasoning tasks, where an agent must be able to precisely manipulate an object into a position that satisfies the task from a set of initial conditions. Often, task success is defined based on the relationship between two objects - for instance, hanging a mug on a rack. In such cases, the solution should be equivariant to the initial position of the objects as well as the agent, and invariant to the pose of the camera. This poses a challenge for learning systems which attempt to solve this task by learning directly from high-dimensional demonstrations: the agent must learn to be both equivariant as well as precise, which can be challenging without any inductive biases about the problem. In this work, we propose a method for precise relative pose prediction which is provably SE(3)-equivariant, can be learned from only a few demonstrations, and can generalize across variations in a class of objects. We accomplish this by factoring the problem into learning an SE(3) invariant task-specific representation of the scene and then interpreting this representation with novel geometric reasoning layers which are provably SE(3) equivariant. We demonstrate that our method can yield substantially more precise predictions in simulated placement tasks than previous methods trained with the same amount of data, and can accurately represent relative placement relationships data collected from real-world demonstrations. Supplementary information and videos can be found at this URL. A critical component of many robotic manipulation tasks is deciding how objects in the scene should move to accomplish the task. Many tasks are based on the relative relationship between a set of objects, sometimes referred to as "relative placement" tasks (Simeonov et al. (2022); Pan et al. (2023); Simeonov et al. (2023); Liu et al. (2022)).
On Time-Indexing as Inductive Bias in Deep RL for Sequential Manipulation Tasks
Qureshi, M. Nomaan, Eisner, Ben, Held, David
In standard policy learning, a single neural-network based policy is tasked with learning both of these skills (and learning to switch between them), without any access to structures that explicitly encode the multi-modal nature of task space.Ideally, policies would be able to emergently learn to decompose tasks at different levels of abstraction, and factor the task learning into unique skills. One common approach is to try and jointly learn a set of subskills, as well as a selection function which selects a specific subskill to execute at the current time step [5]. This poses a fundamental bootstrapping issue: as the skills change and improve, the selection function must change and improve as well, which can lead to unstable training. An important observation of many optimal policies for manipulation tasks is that skills tend to be executed in sequence, without backtracking. Therefore, time itself can serve as a useful indicator for skill selection. For instance, while executing a stacking task, it is reasonable to assume that the robot will undertake the'reach' skill at the start of the task, and subsequently perform the'stack' skill towards the end of the task. Our intuition here is that selecting the'skill' according to which time-step we are currently at can be used as a good strategy for selecting
FlowBot++: Learning Generalized Articulated Objects Manipulation via Articulation Projection
Zhang, Harry, Eisner, Ben, Held, David
Understanding and manipulating articulated objects, such as doors and drawers, is crucial for robots operating in human environments. We wish to develop a system that can learn to articulate novel objects with no prior interaction, after training on other articulated objects. Previous approaches for articulated object manipulation rely on either modular methods which are brittle or end-to-end methods, which lack generalizability. This paper presents FlowBot++, a deep 3D vision-based robotic system that predicts dense per-point motion and dense articulation parameters of articulated objects to assist in downstream manipulation tasks. FlowBot++ introduces a novel per-point representation of the articulated motion and articulation parameters that are combined to produce a more accurate estimate than either method on their own. Simulated experiments on the PartNet-Mobility dataset validate the performance of our system in articulating a wide range of objects, while real-world experiments on real objects' point clouds and a Sawyer robot demonstrate the generalizability and feasibility of our system in real-world scenarios.
TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation
Pan, Chuer, Okorn, Brian, Zhang, Harry, Eisner, Ben, Held, David
How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship "cross-pose" and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method's capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Supplementary information and videos can be found at https://sites.google.com/view/tax-pose/home.
FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects
Eisner, Ben, Zhang, Harry, Held, David
We propose a visionbased system that learns to predict the potential motions of the parts of a variety of articulated objects to guide downstream motion planning of the system to articulate the objects. To predict the object motions, we train a neural network to output a dense vector field representing the point-wise motion direction of the points in the point cloud under articulation. We then deploy an analytical motion planner based on this vector field to achieve a policy that yields maximum articulation. We train a Figure 1: FlowBot3D in action. The system first observes the initial configuration single vision model entirely in simulation across all categories of the object of interest, estimates the per-point articulation of objects, and we demonstrate the capability of our system flow of the point cloud (3DAF), then executes the action based on to generalize to unseen object instances and novel categories in the selected flow vector. Here, the red vectors represent the direction both simulation and the real world using the trained model for of flow of each point (object points appear in blue); the magnitude of all categories, deploying our policy on a Sawyer robot with no the vector corresponds to the relative magnitude of the motion that finetuning. Results show that our system achieves state-of-theart point experiences as the object articulates.
QXplore: Q-learning Exploration by Maximizing Temporal Difference Error
Simmons-Edler, Riley, Eisner, Ben, Mitchell, Eric, Seung, Sebastian, Lee, Daniel
A major challenge in reinforcement learning for continuous state-action spaces is exploration, especially when reward landscapes are very sparse. Several recent methods provide an intrinsic motivation to explore by directly encouraging RL agents to seek novel states. A potential disadvantage of pure state novelty-seeking behavior is that unknown states are treated equally regardless of their potential for future reward. In this paper, we propose that the temporal difference error of predicting primary reward can serve as a secondary reward signal for exploration. This leads to novelty-seeking in the absence of primary reward, and at the same time accelerates exploration of reward-rich regions in sparse (but nonzero) reward landscapes compared to state novelty-seeking. This objective draws inspiration from dopaminergic pathways in the brain that influence animal behavior. We implement this idea with an adversarial method in which Q and Qx are the action-value functions for primary and secondary rewards, respectively. Secondary reward is given by the absolute value of the TD-error of Q. Training is off-policy, based on a replay buffer containing a mixture of trajectories induced by Q and Qx. We characterize performance on a suite of continuous control benchmark tasks against recent state of the art exploration methods and demonstrate comparable or better performance on all tasks, with much faster convergence for Q.
Q-Learning for Continuous Actions with Cross-Entropy Guided Policies
Simmons-Edler, Riley, Eisner, Ben, Mitchell, Eric, Seung, Sebastian, Lee, Daniel
Off-Policy reinforcement learning (RL) is an important class of methods for many problem domains, such as robotics, where the cost of collecting data is high and on-policy methods are consequently intractable. Standard methods for applying Q-learning to continuous-valued action domains involve iteratively sampling the Q-function to find a good action (e.g. via hill-climbing), or by learning a policy network at the same time as the Q-function (e.g. DDPG). Both approaches make tradeoffs between stability, speed, and accuracy. We propose a novel approach, called Cross-Entropy Guided Policies, or CGP, that draws inspiration from both classes of techniques. CGP aims to combine the stability and performance of iterative sampling policies with the low computational cost of a policy network. Our approach trains the Q-function using iterative sampling with the Cross-Entropy Method (CEM), while training a policy network to imitate CEM's sampling behavior. We demonstrate that our method is more stable to train than state of the art policy network methods, while preserving equivalent inference time compute costs, and achieving competitive total reward on standard benchmarks.