Reinforcement Learning
Logical Team Q-learning: An approach towards factored policies in cooperative MARL
We address the challenge of learning factored policies in cooperative MARL scenarios. In particular, we consider the situation in which a team of agents collaborates to optimize a common cost. Our goal is to obtain factored policies that determine the individual behavior of each agent so that the resulting joint policy is optimal. In this work we make contributions to both the dynamic programming and reinforcement learning settings. In the dynamic programming case we provide a number of lemmas that prove the existence of such factored policies and we introduce an algorithm (along with proof of convergence) that provably leads to them. Then we introduce tabular and deep versions of Logical Team Q-learning, which is a stochastic version of the algorithm for the RL case. We conclude the paper by providing experiments that illustrate the claims.
Rapid Task-Solving in Novel Environments
Ritter, Sam, Faulkner, Ryan, Sartran, Laurent, Santoro, Adam, Botvinick, Matt, Raposo, David
When thrust into an unfamiliar environment and charged with solving a series of tasks, an effective agent should (1) leverage prior knowledge to solve its current task while (2) efficiently exploring to gather knowledge for use in future tasks, and then (3) plan using that knowledge when faced with new tasks in that same environment. We introduce two domains for conducting research on this challenge, and find that state-of-the-art deep reinforcement learning (RL) agents fail to plan in novel environments. We develop a recursive implicit planning module that operates over episodic memories, and show that the resulting deep-RL agent is able to explore and plan in novel environments, outperforming the nearest baseline by factors of 2-3 across the two domains. We find evidence that our module (1) learned to execute a sensible information-propagating algorithm and (2) generalizes to situations beyond its training experience.
Curiosity Killed the Cat and the Asymptotically Optimal Agent
Cohen, Michael K., Hutter, Marcus
Reinforcement learners are agents that learn to pick actions that lead to high reward. Ideally, the value of a reinforcement learner's policy approaches optimality--where the optimal informed policy is the one which maximizes reward. Unfortunately, we show that if an agent is guaranteed to be "asymptotically optimal" in any (stochastically computable) environment, then subject to an assumption about the true environment, this agent will be either destroyed or incapacitated with probability 1; both of these are forms of traps as understood in the Markov Decision Process literature. Environments with traps pose a well-known problem for agents, but we are unaware of other work which shows that traps are not only a risk, but a certainty, for agents of a certain caliber. Much work in reinforcement learning uses an ergodicity assumption to avoid this problem. Often, doing theoretical research under simplifying assumptions prepares us to provide practical solutions even in the absence of those assumptions, but the ergodicity assumption in reinforcement learning may have led us entirely astray in preparing safe and effective exploration strategies for agents in dangerous environments. Rather than assuming away the problem, we present an agent with the modest guarantee of approaching the performance of a mentor, doing safe exploration instead of reckless exploration.
Wasserstein Distance guided Adversarial Imitation Learning with Reward Shape Exploration
Zhang, Ming, Wang, Yawei, Ma, Xiaoteng, Xia, Li, Yang, Jun, Li, Zhiheng, Li, Xiu
The generative adversarial imitation learning (GAIL) has provided an adversarial learning framework for imitating expert policy from demonstrations in high-dimensional continuous tasks. However, almost all GAIL and its extensions only design a kind of reward function of logarithmic form in the adversarial training strategy with the Jensen-Shannon (JS) divergence for all complex environments. The fixed logarithmic type of reward function may be difficult to solve all complex tasks, and the vanishing gradients problem caused by the JS divergence will harm the adversarial learning process. In this paper, we propose a new algorithm named Wasserstein Distance guided Adversarial Imitation Learning (WDAIL) for promoting the performance of imitation learning (IL). There are three improvements in our method: (a) introducing the Wasserstein distance to obtain more appropriate measure in adversarial training process, (b) using proximal policy optimization (PPO) in the reinforcement learning stage which is much simpler to implement and makes the algorithm more efficient, and (c) exploring different reward function shapes to suit different tasks for improving the performance. The experiment results show that the learning procedure remains remarkably stable, and achieves significant performance in the complex continuous control tasks of MuJoCo.
DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret
Dynamic treatment regimes (DTRs) are personalized, adaptive, multi-stage treatment plans that adapt treatment decisions both to an individual's initial features and to intermediate outcomes and features at each subsequent stage, which are affected by decisions in prior stages. Examples include personalized first- and second-line treatments of chronic conditions like diabetes, cancer, and depression, which adapt to patient response to first-line treatment, disease progression, and individual characteristics. While existing literature mostly focuses on estimating the optimal DTR from offline data such as from sequentially randomized trials, we study the problem of developing the optimal DTR in an online manner, where the interaction with each individual affect both our cumulative reward and our data collection for future learning. We term this the DTR bandit problem. We propose a novel algorithm that, by carefully balancing exploration and exploitation, is guaranteed to achieve rate-optimal regret when the transition and reward models are linear. We demonstrate our algorithm and its benefits both in synthetic experiments and in a case study of adaptive treatment of major depressive disorder using real-world data.
Advanced AI: Deep Reinforcement Learning in Python
Online Courses Udemy Advanced AI: Deep Reinforcement Learning in Python, The Complete Guide to Mastering Artificial Intelligence using Deep Learning and Neural Networks Created by Lazy Programmer Team, Lazy Programmer Inc. English [Auto-generated], Indonesian [Auto-generated], 5 more Students also bought Deep Learning: Convolutional Neural Networks in Python Deep Learning: Recurrent Neural Networks in Python Unsupervised Machine Learning Hidden Markov Models in Python Bayesian Machine Learning in Python: A/B Testing Data Science: Supervised Machine Learning in Python Preview this course GET COUPON CODE Description This course is all about the application of deep learning and neural networks to reinforcement learning. If you've taken my first reinforcement learning class, then you know that reinforcement learning is on the bleeding edge of what we can do with AI. Specifically, the combination of deep learning with reinforcement learning has led to AlphaGo beating a world champion in the strategy game Go, it has led to self-driving cars, and it has led to machines that can play video games at a superhuman level. Reinforcement learning has been around since the 70s but none of this has been possible until now. The world is changing at a very fast pace.
Constrained Reinforcement Learning for Dynamic Optimization under Uncertainty
Petsagkourakis, Panagiotis, Sandoval, Ilya Orson, Bradford, Eric, Zhang, Dongda, Chanona, Ehecatl Antonio del Rรญo
Dynamic real-time optimization (DRTO) is a challenging task due to the fact that optimal operating conditions must be computed in real time. The main bottleneck in the industrial application of DRTO is the presence of uncertainty. Many stochastic systems present the following obstacles: 1) plant-model mismatch, 2) process disturbances, 3) risks in violation of process constraints. To accommodate these difficulties, we present a constrained reinforcement learning (RL) based approach. RL naturally handles the process uncertainty by computing an optimal feedback policy. However, no state constraints can be introduced intuitively. To address this problem, we present a chance-constrained RL methodology. We use chance constraints to guarantee the probabilistic satisfaction of process constraints, which is accomplished by introducing backoffs, such that the optimal policy and backoffs are computed simultaneously. Backoffs are adjusted using the empirical cumulative distribution function to guarantee the satisfaction of a joint chance constraint. The advantage and performance of this strategy are illustrated through a stochastic dynamic bioprocess optimization problem, to produce sustainable high-value bioproducts.
Spatial Action Maps for Mobile Manipulation
Wu, Jimmy, Sun, Xingyuan, Zeng, Andy, Song, Shuran, Lee, Johnny, Rusinkiewicz, Szymon, Funkhouser, Thomas
Typical end-to-end formulations for learning robotic navigation involve predicting a small set of steering command actions (e.g., step forward, turn left, turn right, etc.) from images of the current state (e.g., a bird's-eye view of a SLAM reconstruction). Instead, we show that it can be advantageous to learn with dense action representations defined in the same domain as the state. In this work, we present "spatial action maps," in which the set of possible actions is represented by a pixel map (aligned with the input image of the current state), where each pixel represents a local navigational endpoint at the corresponding scene location. Using ConvNets to infer spatial action maps from state images, action predictions are thereby spatially anchored on local visual features in the scene, enabling significantly faster learning of complex behaviors for mobile manipulation tasks with reinforcement learning. In our experiments, we task a robot with pushing objects to a goal location, and find that policies learned with spatial action maps achieve much better performance than traditional alternatives.
Solving Hard AI Planning Instances Using Curriculum-Driven Deep Reinforcement Learning
Feng, Dieqiao, Gomes, Carla P., Selman, Bart
Despite significant progress in general AI planning, certain domains remain out of reach of current AI planning systems. Sokoban is a PSPACE-complete planning task and represents one of the hardest domains for current AI planners. Even domain-specific specialized search methods fail quickly due to the exponential search complexity on hard instances. Our approach based on deep reinforcement learning augmented with a curriculum-driven method is the first one to solve hard instances within one day of training while other modern solvers cannot solve these instances within any reasonable time limit. In contrast to prior efforts, which use carefully handcrafted pruning techniques, our approach automatically uncovers domain structure. Our results reveal that deep RL provides a promising framework for solving previously unsolved AI planning problems, provided a proper training curriculum can be devised.
The growth and form of knowledge networks by kinesthetic curiosity
Zhou, Dale, Lydon-Staley, David M., Zurn, Perry, Bassett, Danielle S.
Throughout life, we might seek a calling, companions, skills, entertainment, truth, self-knowledge, beauty, and edification. The practice of curiosity can be viewed as an extended and open-ended search for valuable information with hidden identity and location in a complex space of interconnected information. Despite its importance, curiosity has been challenging to computationally model because the practice of curiosity often flourishes without specific goals, external reward, or immediate feedback. Here, we show how network science, statistical physics, and philosophy can be integrated into an approach that coheres with and expands the psychological taxonomies of specific-diversive and perceptual-epistemic curiosity. Using this interdisciplinary approach, we distill functional modes of curious information seeking as searching movements in information space. The kinesthetic model of curiosity offers a vibrant counterpart to the deliberative predictions of model-based reinforcement learning. In doing so, this model unearths new computational opportunities for identifying what makes curiosity curious.