Reinforcement Learning
Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation
Sun, Jun, Wang, Gang, Giannakis, Georgios B., Yang, Qinmin, Yang, Zaiyue
Thanks to its generality, RL has been widely studied in many areas, such as control theory, game theory, operations research, multi-agent systems, machine learning, artificial intelligence, and statistics [23]. In recent years, combining with deep learning, RL has demonstrated its great potential in addressing challenging practical control and optimization problems [17, 21]. Among all possible algorithms, the temporal difference (TD) learning has arguably become one of the most popular RL algorithms so far, which is further dominated by the celebrated TD(0) algorithm [22]. TD learning provides an iterative process to update an estimate of the so-termed value function v π(s) with respect to a given policy π based on temporally successive samples. Dealing with a finite state space, the classical version of the TD(0) algorithm adopts a tabular representation for v π(s), which stores entry-wise value estimates on a per state basis. J. Sun and Q. Yang are with the College of Control Science and Engineering, and the State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China. G. Wang and G. B. Giannakis are with the Digital Technology Center and the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA. Z. Yang is with the Department of Mechanical and Energy Engineering, Southern University of Science and Technology, Shenzhen, China.
Non-Cooperative Inverse Reinforcement Learning
Zhang, Xiangyuan, Zhang, Kaiqing, Miehling, Erik, Başar, Tamer
Making decisions in the presence of a strategic opponent requires one to take into account the opponent's ability to actively mask its intended objective. To describe such strategic situations, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formalism. The N-CIRL formalism consists of two agents with completely misaligned objectives, where only one of the agents knows the true objective function. Formally, we model the N-CIRL formalism as a zero-sum Markov game with one-sided incomplete information. Through interacting with the more informed player, the less informed player attempts to both infer, and act according to, the true objective function. As a result of the one-sided incomplete information, the multi-stage game can be decomposed into a sequence of single-stage games expressed by a recursive formula. Solving this recursive formula yields the value of the N-CIRL game and the more informed player's equilibrium strategy. Another recursive formula, constructed by forming an auxiliary game, termed the dual game, yields the less informed player's strategy. Building upon these two recursive formulas, we develop a computationally tractable algorithm to approximately solve for the equilibrium strategies. Finally, we demonstrate the benefits of our N-CIRL formalism over the existing multi-agent IRL formalism via extensive numerical simulation in a novel cyber security setting.
Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs
Zanette, Andrea, Brunskill, Emma
In order to make good decision under uncertainty an agent must learn from observations. To do so, two of the most common frameworks are Contextual Bandits and Markov Decision Processes (MDPs). In this paper, we study whether there exist algorithms for the more general framework (MDP) which automatically provide the best performance bounds for the specific problem at hand without user intervention and without modifying the algorithm. In particular, it is found that a very minor variant of a recently proposed reinforcement learning algorithm for MDPs already matches the best possible regret bound $\tilde O (\sqrt{SAT})$ in the dominant term if deployed on a tabular Contextual Bandit problem despite the agent being agnostic to such setting.
Maximum Entropy Diverse Exploration: Disentangling Maximum Entropy Reinforcement Learning
Cohen, Andrew, Yu, Lei, Qiao, Xingye, Tong, Xiangrong
Two hitherto disconnected threads of research, diverse exploration (DE) and maximum entropy RL have addressed a wide range of problems facing reinforcement learning algorithms via ostensibly distinct mechanisms. In this work, we identify a connection between these two approaches. First, a discriminator-based diversity objective is put forward and connected to commonly used divergence measures. We then extend this objective to the maximum entropy framework and propose an algorithm Maximum Entropy Diverse Exploration (MEDE) which provides a principled method to learn diverse behaviors. A theoretical investigation shows that the set of policies learned by MEDE capture the same modalities as the optimal maximum entropy policy. In effect, the proposed algorithm disentangles the maximum entropy policy into its diverse, constituent policies. Experiments show that MEDE is superior to the state of the art in learning high performing and diverse policies.
On Solving the 2-Dimensional Greedy Shooter Problem for UAVs
Anderson, Loren, Senapathy, Sahitya
Unmanned Aerial Vehicles (UAVs), autonomously-guided aircraft, are widely used for tasks involving surveillance and reconnaissance. A version of the pursuit-evasion problems centered around UAVs and its variants has been extensively studied in recent years due to numerous breakthroughs in AI. We present an approach to UAV pursuit-evasion in a 2D aerial-engagement environment using reinforcement learning (RL), a machine learning paradigm concerned with goal-oriented algorithms. In this work, a UAV wielding the greedy shooter strategy engages with a UAV trained using deep Q-learning techniques. Simulated results show that the latter UAV wins every engagement in which the UAVs are suffciently separated during initialization. This approach highlights an exhaustive and robust application of reinforcement learning to pursuit-evasion that provides insight into effective strategies for UAV flight and interaction.
Learning from Trajectories via Subgoal Discovery
Paul, Sujoy, van Baar, Jeroen, Roy-Chowdhury, Amit K.
Learning to solve complex goal-oriented tasks with sparse terminal-only rewards often requires an enormous number of samples. In such cases, using a set of expert trajectories could help to learn faster. However, Imitation Learning (IL) via supervised pre-training with these trajectories may not perform as well and generally requires additional finetuning with expert-in-the-loop. In this paper, we propose an approach which uses the expert trajectories and learns to decompose the complex main task into smaller sub-goals. We learn a function which partitions the state-space into sub-goals, which can then be used to design an extrinsic reward function. We follow a strategy where the agent first learns from the trajectories using IL and then switches to Reinforcement Learning (RL) using the identified sub-goals, to alleviate the errors in the IL step. To deal with states which are under-represented by the trajectory set, we also learn a function to modulate the sub-goal predictions. We show that our method is able to solve complex goal-oriented tasks, which other RL, IL or their combinations in literature are not able to solve.
Reinforcement Learning Algorithms with Python: Learn, understand, and develop smart algorithms for addressing AI challenges: Andrea Lonza: 9781789131116: Amazon.com: Books
Andrea Lonza is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.
AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
Exploration is another key challenge in complex environments such as StarCraft. There are up to 1026 possible actions available to one of our agents at each time step, and the agent must make thousands of actions before learning if it has won or lost the game. Finding winning strategies is challenging in such a massive solution space. Even with a strong self-play system and a diverse league of main and exploiter agents, there would be almost no chance of a system developing successful strategies in such a complex environment without some prior knowledge. Learning human strategies, and ensuring that the agents keep exploring those strategies throughout self-play, was key to unlocking AlphaStar's performance.
Frequentist Regret Bounds for Randomized Least-Squares Value Iteration
Zanette, Andrea, Brandfonbrener, David, Pirotta, Matteo, Lazaric, Alessandro
A key challenge in reinforcement learning (RL) is how to bala nce exploration and exploitation in order to efficiently learn to make good sequences of decisions in a way that is both computationally tractable and statistically efficient. In the tabular case, the exploration-exploitation problem is well-understood for a number of settings (e.g., finite-hori zon, average reward, infinite horizon with discount), explorati on objectives (e.g., regret minimization and probably approximately correct), and for different algorithmic appro aches, where optimism-under-uncertainty [JOA10, FPLO18] and Thompson sampling (TS) [OBPVR16, Rus19] are the most pop ular principles. For instance, in the finite-horizon setting, [AOM17] and [ZB19] recently derived minimax optim al and structure adaptive regret bounds for optimistic exploration algorithms. TSbased algorithms have mainly b een analyzed in tabular MDPs in terms of Bayesian regret [OBPVR16, OR17, OGNJ17], which assumes that the MDP is s ampled from a known prior distribution. These bounds do not hold against a fixed MDP and algorithms with smal l Bayesian regret may still suffer high regret in some hard-to-learn MDPs within the chosen prior. In the tabu lar setting, frequentist (or worst-case) regret analysis h as been developed for TSbased algorithms both in the average r eward [GM15, AJ17] and finite-horizon case [Rus19]. Despite the fact that TSbased approaches have slightly wor se regret bounds compared to optimism-based algorithms, their empirical performance is often superior [CL11, OR17] . Unfortunately, the performance of tabular exploration met hods rapidly degrades with the number of states and actions, thus making them unfeasible in large or continuous MD Ps.