Reinforcement Learning
CATCH: Context-based Meta Reinforcement Learning for Transferrable Architecture Search
Chen, Xin, Duan, Yawen, Chen, Zewei, Xu, Hang, Chen, Zihao, Liang, Xiaodan, Zhang, Tong, Li, Zhenguo
Neural Architecture Search (NAS) achieved many breakthroughs in recent years. In spite of its remarkable progress, many algorithms are restricted to particular search spaces. They also lack efficient mechanisms to reuse knowledge when confronting multiple tasks. These challenges preclude their applicability, and motivate our proposal of CATCH, a novel Context-bAsed meTa reinforcement learning (RL) algorithm for transferrable arChitecture searcH. The combination of meta-learning and RL allows CATCH to efficiently adapt to new tasks while being agnostic to search spaces. CATCH utilizes a probabilistic encoder to encode task properties into latent context variables, which then guide CATCH's controller to quickly "catch" top-performing networks. The contexts also assist a network evaluator in filtering inferior candidates and speed up learning. Extensive experiments demonstrate CATCH's universality and search efficiency over many other widely-recognized algorithms. It is also capable of handling cross-domain architecture search as competitive networks on ImageNet, COCO, and Cityscapes are identified. This is the first work to our knowledge that proposes an efficient transferrable NAS solution while maintaining robustness across various settings.
Provably Good Batch Reinforcement Learning Without Great Exploration
Liu, Yao, Swaminathan, Adith, Agarwal, Alekh, Brunskill, Emma
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a new decision policy may visit states and actions outside the support of the batch data, and function approximation and optimization with limited samples can further increase the potential of learning policies with overly optimistic estimates of their future performance. Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes. Theoretical work that provides strong guarantees on the performance of the output policy relies on a strong concentrability assumption, that makes it unsuitable for cases where the ratio between state-action distributions of behavior policy and some candidate policies is large. This is because in the traditional analysis, the error bound scales up with this ratio. We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees. In certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. We highlight the necessity of our conservative update and the limitations of previous algorithms and analyses by illustrative MDP examples, and demonstrate an empirical comparison of our algorithm and other state-of-the-art batch RL baselines in standard benchmarks.
Batch Policy Learning in Average Reward Markov Decision Processes
Liao, Peng, Qi, Zhengling, Murphy, Susan
We study the problem of policy optimization in Markov Decision Process over infinite time horizons (Puterman, 1994). We focus on the batch (i.e., off-line) setting, where historical data of multiple trajectories has been previously collected using some behavior policy. Our goal is to learn a new policy with guaranteed performance when implemented in the future. In this work, we develop a data-efficient method to learn the policy that optimizes the long-term average reward in a pre-specified policy class from a training set composed of multiple trajectories. Furthermore, we establish a finite-sample regret guarantee, i.e., the difference between the average reward of the optimal policy in the class and the average reward of the estimated policy by our proposed method. This work is motivated by the development of justin-time adaptive intervention in mobile health (mHealth) applications (Nahum-Shani et al., 2017). Our method can be used to learn a treatment policy that maps the real-time collected information about the individual's status and context to a particular treatment at each of many decision times to support health behaviors.
Approximation Benefits of Policy Gradient Methods with Aggregated States
Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregation, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $\epsilon$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $\epsilon/(1-\gamma)$, where $\gamma$ is a discount factor. Theoretical results synthesize recent analysis of policy gradient methods with insights of Van Roy (2006) into the critical role of state-relevance weights in approximate dynamic programming.
FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs
Agarwal, Alekh, Kakade, Sham, Krishnamurthy, Akshay, Sun, Wen
The ability to learn effective transformations of complex data sources, sometimes called representation learning, is an essential primitive in modern machine learning, leading to remarkable achievements in language modeling, vision, and serving as a partial explanation for the success of deep learning more broadly (Bengio et al., 2013). In Reinforcement Learning (RL), several works have shown empirically that learning succinct representations of perceptual inputs can accelerate the search for decision-making policies (Pathak et al., 2017; Tang et al., 2017; Oord et al., 2018; Srinivas et al., 2020). However, representation learning for RL is far more subtle than it is for supervised learning (Du et al., 2019a; Van Roy and Dong, 2019; Lattimore and Szepesvari, 2019), and the theoretical foundations of representation learning for RL are nascent. The first question that arises in this context is: what is a good representation? Intuitively, a good representation should help us achieve greater sample efficiency on downstream tasks.
DeepMind's AI automatically generates reinforcement learning algorithms
In a study printed on the preprint server Arxiv.org, DeepMind researchers describe a reinforcement learning algorithm-generating approach that discovers what to foretell and the way to be taught it by interacting with environments. They declare the generated algorithms carry out nicely on a variety of difficult Atari video video games, reaching "non-trivial" efficiency indicative of the approach's generalizability. Reinforcement studying algorithms -- algorithms that allow software program brokers to be taught in environments by trial and error utilizing suggestions -- replace an agent's parameters in response to one in all a number of guidelines. These guidelines are often found via years of analysis, and automating their discovery from knowledge might result in extra environment friendly algorithms, or algorithms higher tailored to particular environments. DeepMind's answer is a meta-learning framework that collectively discovers what a specific agent ought to predict and the way to use the predictions for coverage enchancment.
How Reinforcement Learning Can Help In Data Valuation
It is well established that machine learning models perform better with well-curated large scale data. However, collecting and curating is one of the biggest challenges right now. There are billion-dollar companies like Scale.ai who set up their shop with the sole purpose to annotate data. The whole data collection process is so tedious that it has become profitable for few. But, we are still talking about what happens before data arrives at an ML pipeline.
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning
Lee, Kimin, Laskin, Michael, Srinivas, Aravind, Abbeel, Pieter
Model-free deep reinforcement learning (RL) has been successful in a range of challenging domains. However, there are some remaining issues, such as stabilizing the optimization of nonlinear function approximators, preventing error propagation due to the Bellman backup in Q-learning, and efficient exploration. To mitigate these issues, we present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy RL algorithms. SUNRISE integrates three key ingredients: (a) bootstrap with random initialization which improves the stability of the learning process by training a diverse ensemble of agents, (b) weighted Bellman backups, which prevent error propagation in Q-learning by reweighing sample transitions based on uncertainty estimates from the ensembles, and (c) an inference method that selects actions using highest upper-confidence bounds for efficient exploration. Our experiments show that SUNRISE significantly improves the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments. Our training code is available at https://github.com/pokaxpoka/sunrise.
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping
Zhou, Dongruo, He, Jiafan, Gu, Quanquan
Designing efficient algorithms that learn and plan in sequential decision-making tasks with large state and action spaces has become the central goal of modern reinforcement learning (RL) in recent years. Due to numerous possible states and actions, traditional tabular reinforcement learning methods (Watkins, 1989; Jaksch et al., 2010; Azar et al., 2017) which directly access each stateaction pair are computationally intractable. A common method to design reinforcement learning algorithms for large-scale state and action spaces is to make use of feature mappings such as linear functions or neural networks to map states and actions to a low-dimensional space and solve the decision-making problem in the feature space. Despite the empirical success of feature mapping based reinforcement learning methods (Singh et al., 1995; Kwok and Fox, 2004; Bertsekas, 2018), the theoretical understanding and the fundamental limits of these methods remain largely understudied. In this paper, we aim to develop provable reinforcement learning algorithms with feature mapping for discounted Markov Decision Processes (MDPs). Discounted MDP is one of the most widely used models to formulate the modern reinforcement learning tasks such as Atari games (Mnih et al., 2015) and deep recommendation system (Zheng et al., 2018).
Integrating Deep Reinforcement Learning Networks with Health System Simulations
Background and motivation: Combining Deep Reinforcement Learning (Deep RL) and Health Systems Simulations has significant potential, for both research into improving Deep RL performance and safety, and in operational practice. While individual toolkits exist for Deep RL and Health Systems Simulations, no framework to integrate the two has been established. Aim: Provide a framework for integrating Deep RL Networks with Health System Simulations, and to ensure this framework is compatible with Deep RL agents that have been developed and tested using OpenAI Gym. Methods: We developed our framework based on the OpenAI Gym framework, and demonstrate its use on a simple hospital bed capacity model. We built the Deep RL agents using PyTorch, and the Hospital Simulatation using SimPy. Results: We demonstrate example models using a Double Deep Q Network or a Duelling Double Deep Q Network as the Deep RL agent. Conclusion: SimPy may be used to create Health System Simulations that are compatible with agents developed and tested on OpenAI Gym environments. GitHub repository of code: https://github.com/MichaelAllen1966/learninghospital