Goto

Collaborating Authors

 meta-rl


Doubly Robust Augmented Transfer for Meta-Reinforcement Learning

Neural Information Processing Systems

RL problems through the idea of "learning to learn". Current meta-RL methods can be classified in to two categories. These methods mainly differ in their ways of inference [3, 4, 20]. The other line follows the technique of relabeling that enables sample reuse across tasks, i.e., learning a task Packer et al. apply hindsight relabeling for meta-RL, and propose hindsight task relabeling (HTR) to relabel the trajectories Taking a step further than hindsight relabelling, Wan et al. introduce additionally foresight Huang et al. derive a general form of policy gradient from DR value estimator [29], whereas a DR off-policy actor-critic Kallus et al. propose the doubly robust method to find a robust policy that can Depending on the knowledge to be transferred, these methods in RL can be roughly divided into classes including sampled transitions [32, 33], learned policies or value networks [34, 35, 36, 37], features [38, 39, 40], and skills [41, 42]. Doubly Robust Property for Direct Use of Doubly Robust Estimator We show the doubly robust property of the DR estimator for value function in Eq. (5) in the main text, as follows.


Recurrent Hypernetworks are Surprisingly Strong in Meta-RL

Neural Information Processing Systems

Deep reinforcement learning (RL) is notoriously impractical to deploy due to sample inefficiency. Meta-RL directly addresses this sample inefficiency by learning to perform few-shot learning when a distribution of related tasks is available for meta-training. While many specialized meta-RL methods have been proposed, recent work suggests that end-to-end learning in conjunction with an off-the-shelf sequential model, such as a recurrent network, is a surprisingly strong baseline. However, such claims have been controversial due to limited supporting evidence, particularly in the face of prior work establishing precisely the opposite. In this paper, we conduct an empirical investigation. While we likewise find that a recurrent network can achieve strong performance, we demonstrate that the use of hypernetworks is crucial to maximizing their potential. Surprisingly, when combined with hypernetworks, the recurrent baselines that are far simpler than existing specialized methods actually achieve the strongest performance of all methods evaluated. We provide code at https://github.com/jacooba/hyper.


Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture

Neural Information Processing Systems

The generalization ability of most meta-reinforcement learning (meta-RL) methods is largely limited to test tasks that are sampled from the same distribution used to sample training tasks. To overcome the limitation, we propose Latent Dynamics Mixture (LDM) that trains a reinforcement learning agent with imaginary tasks generated from mixtures of learned latent dynamics. By training a policy on mixture tasks along with original training tasks, LDM allows the agent to prepare for unseen test tasks during training and prevents the agent from overfitting the training tasks. LDM significantly outperforms standard meta-RL methods in test returns on the gridworld navigation and MuJoCo tasks where we strictly separate the training task distribution and the test task distribution.


Enhanced Meta Reinforcement Learning via Demonstrations in Sparse Reward Environments

Neural Information Processing Systems

Meta reinforcement learning (Meta-RL) is an approach wherein the experience gained from solving a variety of tasks is distilled into a meta-policy. The meta-policy, when adapted over only a small (or just a single) number of steps, is able to perform near-optimally on a new, related task. However, a major challenge to adopting this approach to solve real-world problems is that they are often associated with sparse reward functions that only indicate whether a task is completed partially or fully. We consider the situation where some data, possibly generated by a sub-optimal agent, is available for each task. We then develop a class of algorithms entitled Enhanced Meta-RL via Demonstrations (EMRLD) that exploit this information---even if sub-optimal---to obtain guidance during training. We show how EMRLD jointly utilizes RL and supervised learning over the offline data to generate a meta-policy that demonstrates monotone performance improvements. We also develop a warm started variant called EMRLD-WS that is particularly efficient for sub-optimal demonstration data. Finally, we show that our EMRLD algorithms significantly outperform existing approaches in a variety of sparse reward environments, including that of a mobile robot.


Doubly Robust Augmented Transfer for Meta-Reinforcement Learning

Neural Information Processing Systems

RL problems through the idea of "learning to learn". Current meta-RL methods can be classified in to two categories. These methods mainly differ in their ways of inference [3, 4, 20]. The other line follows the technique of relabeling that enables sample reuse across tasks, i.e., learning a task Packer et al. apply hindsight relabeling for meta-RL, and propose hindsight task relabeling (HTR) to relabel the trajectories Taking a step further than hindsight relabelling, Wan et al. introduce additionally foresight Huang et al. derive a general form of policy gradient from DR value estimator [29], whereas a DR off-policy actor-critic Kallus et al. propose the doubly robust method to find a robust policy that can Depending on the knowledge to be transferred, these methods in RL can be roughly divided into classes including sampled transitions [32, 33], learned policies or value networks [34, 35, 36, 37], features [38, 39, 40], and skills [41, 42]. Doubly Robust Property for Direct Use of Doubly Robust Estimator We show the doubly robust property of the DR estimator for value function in Eq. (5) in the main text, as follows.


Recurrent Hypernetworks are Surprisingly Strong in Meta-RL

Neural Information Processing Systems

Deep reinforcement learning (RL) is notoriously impractical to deploy due to sample inefficiency. Meta-RL directly addresses this sample inefficiency by learning to perform few-shot learning when a distribution of related tasks is available for meta-training. While many specialized meta-RL methods have been proposed, recent work suggests that end-to-end learning in conjunction with an off-the-shelf sequential model, such as a recurrent network, is a surprisingly strong baseline. However, such claims have been controversial due to limited supporting evidence, particularly in the face of prior work establishing precisely the opposite. In this paper, we conduct an empirical investigation. While we likewise find that a recurrent network can achieve strong performance, we demonstrate that the use of hypernetworks is crucial to maximizing their potential.


Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture

Neural Information Processing Systems

The generalization ability of most meta-reinforcement learning (meta-RL) methods is largely limited to test tasks that are sampled from the same distribution used to sample training tasks. To overcome the limitation, we propose Latent Dynamics Mixture (LDM) that trains a reinforcement learning agent with imaginary tasks generated from mixtures of learned latent dynamics. By training a policy on mixture tasks along with original training tasks, LDM allows the agent to prepare for unseen test tasks during training and prevents the agent from overfitting the training tasks. LDM significantly outperforms standard meta-RL methods in test returns on the gridworld navigation and MuJoCo tasks where we strictly separate the training task distribution and the test task distribution.


Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

arXiv.org Artificial Intelligence

The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the utilization of past information, adaptation to novel environments, and improved sample efficiency. However, the term "memory" encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities and prevents objective comparison with other memory-enhanced agents. This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types, such as long-term versus short-term memory and declarative versus procedural memory, inspired by cognitive science. Using these definitions, we categorize different classes of agent memory, propose a robust experimental methodology for evaluating the memory capabilities of RL agents, and standardize evaluations. Furthermore, we empirically demonstrate the importance of adhering to the proposed methodology when evaluating different types of agent memory by conducting experiments with different RL agents and what its violation leads to. Reinforcement Learning (RL) effectively addresses various problems within the Markov Decision Process (MDP) framework, where agents make decisions based on immediately available information (Mnih et al., 2015; Badia et al., 2020). However, there are still challenges in applying RL to more complex tasks with partial observability. To successfully address such challenges, it is essential that an agent is able to efficiently store and process the history of its interactions with the environment (Ni et al., 2021). Sequence processing methods originally developed for natural language processing (NLP) can be effectively applied to these tasks because the history of interactions with the environment can be represented as a sequence (Hausknecht & Stone, 2015; Esslinger et al., 2022; Samsami et al., 2024). However, in many tasks, due to the complexity or noisiness of observations, the sparsity of events, the difficulty of designing the reward function, and the long duration of episodes, storing and retrieving important information becomes extremely challenging, and the need for memory mechanisms arises (Graves et al., 2016; Wayne et al., 2018; Goyal et al., 2022).


Enhanced Meta Reinforcement Learning via Demonstrations in Sparse Reward Environments

Neural Information Processing Systems

Meta reinforcement learning (Meta-RL) is an approach wherein the experience gained from solving a variety of tasks is distilled into a meta-policy. The meta-policy, when adapted over only a small (or just a single) number of steps, is able to perform near-optimally on a new, related task. However, a major challenge to adopting this approach to solve real-world problems is that they are often associated with sparse reward functions that only indicate whether a task is completed partially or fully. We consider the situation where some data, possibly generated by a sub-optimal agent, is available for each task. We then develop a class of algorithms entitled Enhanced Meta-RL via Demonstrations (EMRLD) that exploit this information---even if sub-optimal---to obtain guidance during training.


Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization

arXiv.org Artificial Intelligence

Meta Reinforcement Learning (Meta-RL) has seen substantial advancements recently. In particular, off-policy methods were developed to improve the data efficiency of Meta-RL techniques. \textit{Probabilistic embeddings for actor-critic RL} (PEARL) is a leading approach for multi-MDP adaptation problems. A major drawback of many existing Meta-RL methods, including PEARL, is that they do not explicitly consider the safety of the prior policy when it is exposed to a new task for the first time. Safety is essential for many real-world applications, including field robots and Autonomous Vehicles (AVs). In this paper, we develop the PEARL PLUS (PEARL$^+$) algorithm, which optimizes the policy for both prior (pre-adaptation) safety and posterior (after-adaptation) performance. Building on top of PEARL, our proposed PEARL$^+$ algorithm introduces a prior regularization term in the reward function and a new Q-network for recovering the state-action value under prior context assumptions, to improve the robustness to task distribution shift and safety of the trained network exposed to a new task for the first time. The performance of PEARL$^+$ is validated by solving three safety-critical problems related to robots and AVs, including two MuJoCo benchmark problems. From the simulation experiments, we show that safety of the prior policy is significantly improved and more robust to task distribution shift compared to PEARL.