Plotting

 Stone, Peter


Learning to Look: Seeking Information for Decision Making via Policy Factorization

arXiv.org Artificial Intelligence

Intelligent decisions can only be made based on the right information. When operating in the environment, an intelligent agent actively seeks the information that enables it to select the right actions and proceeds with the task only when it is confident enough. For example, when following a video recipe, a chef would look at the TV to obtain information about the next ingredient to grasp, and later look at a timer to decide when to turn off the stove. In contrast, current learning robots assume that the information needed for manipulation is readily available in their sensor signals (e.g., from a stationary camera looking at a tabletop manipulation setting) or rely on a given low-dimensional state representation predefined by a human (e.g., object pose) that also has to provide the means for the robot to perceive it. In this work, our goal is to endow robots with the capabilities to learn to perform information-seeking actions to find the information that enables manipulation, using as supervision the quality of the informed actions and switching between active perception and manipulation only based on the uncertainty about what manipulation action should come next. Performing actions to reveal information has been previously explored in the subfields of active and interactive perception. In active perception [1, 2, 3], an agent changes the parameters of its sensors (e.g., camera pose [4, 5, 6] or parameters [7, 8, 9]) to infer information such as object pose, shape, or material. Interactive perception [10] solutions go one step further and enable agents to change the state of the environment to create information-rich signals to perceive kinematics [11, 12], material [13], or other properties [14, 15, 16, 17].


Disentangled Unsupervised Skill Discovery for Efficient Hierarchical Reinforcement Learning

arXiv.org Artificial Intelligence

A hallmark of intelligent agents is the ability to learn reusable skills purely from unsupervised interaction with the environment. However, existing unsupervised skill discovery methods often learn entangled skills where one skill variable simultaneously influences many entities in the environment, making downstream skill chaining extremely challenging. We propose Disentangled Unsupervised Skill Discovery (DUSDi), a method for learning disentangled skills that can be efficiently reused to solve downstream tasks. DUSDi decomposes skills into disentangled components, where each skill component only affects one factor of the state space. Importantly, these skill components can be concurrently composed to generate low-level actions, and efficiently chained to tackle downstream tasks through hierarchical Reinforcement Learning. DUSDi defines a novel mutual-information-based objective to enforce disentanglement between the influences of different skill components, and utilizes value factorization to optimize this objective efficiently. Evaluated in a set of challenging environments, DUSDi successfully learns disentangled skills, and significantly outperforms previous skill discovery methods when it comes to applying the learned skills to solve downstream tasks. Code and skills visualization at jiahenghu.github.io/DUSDi-site/.


SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms-including off-policy, on-policy, and unsupervised methods-is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa's broad applicability and effectiveness across diverse RL algorithms and environments.


Effort Allocation for Deadline-Aware Task and Motion Planning: A Metareasoning Approach

arXiv.org Artificial Intelligence

Their approach involved modeling the problem using a set of processes, each dedicated to searching for a plan, akin to representing search nodes on an open list. Each process is characterized by a probabilistic performance profile, modeled by a random variable indicating the probability of successful termination given processing time, as well as a random variable modeling the deadline corresponding to each partial plan, which is only revealed after planning is concluded. The meta-level problem lies in finding an optimal schedule of processing time across all processes that maximizes the probability that any process delivers a plan before its deadline. A simplified version of this problem, known as "simplified allocating planning effort when actions expire," assumes discrete time intervals and has been proven to be NP-hard. However, under the condition of known deadlines, the problem becomes solvable in pseudo-polynomial time through dynamic programming. Later, this line of work was extended to consider interleaved planning and execution, where partial plans can be executed during the search [62, 63]. While this body of work bears relevance to our research, it primarily concentrates on deriving symbolic plans. In contrast, our focus lies in elaborating existing symbolic plans through motion-level reasoning to make them executable for a robot, optimizing the likelihood of meeting a pre-specified deadline.


Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory

arXiv.org Artificial Intelligence

In order to train agents that can quickly adapt to new objectives or reward functions, efficient unsupervised representation learning in sequential decision-making environments can be important. Frameworks such as the Exogenous Block Markov Decision Process (Ex-BMDP) have been proposed to formalize this representation-learning problem (Efroni et al., 2022b). In the Ex-BMDP framework, the agent's high-dimensional observations of the environment have two latent factors: a controllable factor, which evolves deterministically within a small state space according to the agent's actions, and an exogenous factor, which represents time-correlated noise, and can be highly complex. The goal of the representation learning problem is to learn an encoder that maps from observations into the controllable latent space, as well as the dynamics of this space. Efroni et al. (2022b) has shown that this is possible with a sample complexity that depends only on the size of the controllable latent space, and not on the size of the noise factor. However, this prior work has focused on the episodic setting, where the controllable latent state resets to a specific start state after a finite horizon. By contrast, if the agent can only interact with the environment in a single continuous trajectory, prior works have not established sample-complexity bounds. We propose STEEL, the first provably sample-efficient algorithm for learning the controllable dynamics of an Ex-BMDP from a single trajectory, in the function approximation setting. STEEL has a sample complexity that depends only on the sizes of the controllable latent space and the encoder function class, and (at worst linearly) on the mixing time of the exogenous noise factor. We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems. Code is available at: https://github.com/midi-lab/steel.


Fine-Grained Gradient Restriction: A Simple Approach for Mitigating Catastrophic Forgetting

arXiv.org Artificial Intelligence

A fundamental challenge in continual learning is to balance the trade-off between learning new tasks and remembering the previously acquired knowledge. Gradient Episodic Memory (GEM) achieves this balance by utilizing a subset of past training samples to restrict the update direction of the model parameters. In this work, we start by analyzing an often overlooked hyper-parameter in GEM, the memory strength, which boosts the empirical performance by further constraining the update direction. We show that memory strength is effective mainly because it improves GEM's generalization ability and therefore leads to a more favorable trade-off. By this finding, we propose two approaches that more flexibly constrain the update direction. Our methods are able to achieve uniformly better Pareto Frontiers of remembering old and learning new knowledge than using memory strength. We further propose a computationally efficient method to approximately solve the optimization problem with more constraints.


FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning

arXiv.org Artificial Intelligence

In recent years, the Robotics field has initiated several efforts toward building generalist robot policies through large-scale multi-task Behavior Cloning. However, direct deployments of these policies have led to unsatisfactory performance, where the policy struggles with unseen states and tasks. How can we break through the performance plateau of these models and elevate their capabilities to new heights? In this paper, we propose FLaRe, a large-scale Reinforcement Learning fine-tuning framework that integrates robust pre-trained representations, large-scale training, and gradient stabilization techniques. Our method aligns pre-trained policies towards task completion, achieving state-of-the-art (SoTA) performance both on previously demonstrated and on entirely novel tasks and embodiments. Specifically, on a set of long-horizon mobile manipulation tasks, FLaRe achieves an average success rate of 79.5% in unseen environments, with absolute improvements of +23.6% in simulation and +30.7% on real robots over prior SoTA methods. By utilizing only sparse rewards, our approach can enable generalizing to new capabilities beyond the pretraining data with minimal human effort. Moreover, we demonstrate rapid adaptation to new embodiments and behaviors with less than a day of fine-tuning. Videos can be found on the project website at https://robot-flare.github.io/


Grounded Curriculum Learning

arXiv.org Artificial Intelligence

The high cost of real-world data for robotics Reinforcement Learning (RL) leads to the wide usage of simulators. Despite extensive work on building better dynamics models for simulators to match with the real world, there is another, often-overlooked mismatch between simulations and the real world, namely the distribution of available training tasks. Such a mismatch is further exacerbated by existing curriculum learning techniques, which automatically vary the simulation task distribution without considering its relevance to the real world. Considering these challenges, we posit that curriculum learning for robotics RL needs to be grounded in real-world task distributions. To this end, we propose Grounded Curriculum Learning (GCL), which aligns the simulated task distribution in the curriculum with the real world, as well as explicitly considers what tasks have been given to the robot and how the robot has performed in the past. We validate GCL using the BARN dataset on complex navigation tasks, achieving a 6.8% and 6.5% higher success rate compared to a state-of-the-art CL method and a curriculum designed by human experts, respectively. These results show that GCL can enhance learning efficiency and navigation performance by grounding the simulation task distribution in the real world within an adaptive curriculum.


PRESTO: Fast motion planning using diffusion models based on key-configuration environment representation

arXiv.org Artificial Intelligence

We introduce a learning-guided motion planning framework that provides initial seed trajectories using a diffusion model for trajectory optimization. Given a workspace, our method approximates the configuration space (C-space) obstacles through a key-configuration representation that consists of a sparse set of task-related key configurations, and uses this as an input to the diffusion model. The diffusion model integrates regularization terms that encourage collision avoidance and smooth trajectories during training, and trajectory optimization refines the generated seed trajectories to further correct any colliding segments. Our experimental results demonstrate that using high-quality trajectory priors, learned through our C-space-grounded diffusion model, enables efficient generation of collision-free trajectories in narrow-passage environments, outperforming prior learning- and planning-based baselines. Videos and additional materials can be found on the project page: https://kiwi-sherbet.github.io/PRESTO.


Autonomous Ground Navigation in Highly Constrained Spaces: Lessons learned from The 3rd BARN Challenge at ICRA 2024

arXiv.org Artificial Intelligence

The 3rd BARN (Benchmark Autonomous Robot Navigation) Challenge took place at the 2024 IEEE International Conference on Robotics and Automation (ICRA 2024) in Yokohama, Japan and continued to evaluate the performance of state-of-the-art autonomous ground navigation systems in highly constrained environments. Similar to the trend in The 1st and 2nd BARN Challenge at ICRA 2022 and 2023 in Philadelphia (North America) and London (Europe), The 3rd BARN Challenge in Yokohama (Asia) became more regional, i.e., mostly Asian teams participated. The size of the competition has slightly shrunk (six simulation teams, four of which were invited to the physical competition). The competition results, compared to last two years, suggest that the field has adopted new machine learning approaches while at the same time slightly converged to a few common practices. However, the regional nature of the physical participants suggests a challenge to promote wider participation all over the world and provide more resources to travel to the venue. In this article, we discuss the challenge, the approaches used by the three winning teams, and lessons learned to direct future research and competitions.