Reinforcement Learning
Towards Playing Full MOBA Games with Deep Reinforcement Learning
Ye, Deheng, Chen, Guibin, Zhang, Wen, Chen, Sheng, Yuan, Bo, Liu, Bo, Chen, Jia, Liu, Zhao, Qiu, Fuhao, Yu, Hongsheng, Yin, Yinyuting, Shi, Bei, Wang, Liang, Shi, Tengfei, Fu, Qiang, Yang, Wei, Huang, Lanxiao, Liu, Wei
MOBA games, e.g., Honor of Kings, League of Legends, and Dota 2, pose grand challenges to AI systems such as multi-agent, enormous state-action space, complex action control, etc. Developing AI for playing MOBA games has raised much attention accordingly. However, existing work falls short in handling the raw game complexity caused by the explosion of agent combinations, i.e., lineups, when expanding the hero pool in case that OpenAI's Dota AI limits the play to a pool of only 17 heroes. As a result, full MOBA games without restrictions are far from being mastered by any existing AI system. In this paper, we propose a MOBA AI learning paradigm that methodologically enables playing full MOBA games with deep reinforcement learning. Specifically, we develop a combination of novel and existing learning techniques, including curriculum self-play learning, policy distillation, off-policy adaption, multi-head value estimation, and Monte-Carlo tree-search, in training and playing a large pool of heroes, meanwhile addressing the scalability issue skillfully. Tested on Honor of Kings, a popular MOBA game, we show how to build superhuman AI agents that can defeat top esports players. The superiority of our AI is demonstrated by the first large-scale performance test of MOBA AI agent in the literature.
Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning
Kotsalis, Georgios, Lan, Guanghui, Li, Tianjiao
The focus of this paper is on stochastic variational inequalities (VI) under Markovian noise. A prominent application of our algorithmic developments is the stochastic policy evaluation problem in reinforcement learning. Prior investigations in the literature focused on temporal difference (TD) learning by employing nonsmooth finite time analysis motivated by stochastic subgradient descent leading to certain limitations. These encompass the requirement of analyzing a modified TD algorithm that involves projection to an a-priori defined Euclidean ball, achieving a non-optimal convergence rate and no clear way of deriving the beneficial effects of parallel implementation. Our approach remedies these shortcomings in the broader context of stochastic VIs and in particular when it comes to stochastic policy evaluation. We developed a variety of simple TD learning type algorithms motivated by its original version that maintain its simplicity, while offering distinct advantages from a non-asymptotic analysis point of view. We first provide an improved analysis of the standard TD algorithm that can benefit from parallel implementation. Then we present versions of a conditional TD algorithm (CTD), that involves periodic updates of the stochastic iterates, which reduce the bias and therefore exhibit improved iteration complexity. This brings us to the fast TD (FTD) algorithm which combines elements of CTD and the stochastic operator extrapolation method of the companion paper. For a novel index resetting policy FTD exhibits the best known convergence rate. We also devised a robust version of the algorithm that is particularly suitable for discounting factors close to 1.
Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning
Wu, Jingfeng, Braverman, Vladimir, Yang, Lin F.
In single-objective reinforcement learning (RL), a scalar reward is pre-specified and an agent learns a policy to maximize the long-term cumulative reward [Azar et al., 2017, Jin et al., 2018]. However, in many real-world applications, we need to optimize multiple objectives for the same (unknown) environment, even when these objectives are possibly contradicting [Roijers et al., 2013]. For example, in an autonomous driving application, each passenger may have a different preference of driving styles: some of the passengers prefer a very steady riding experience while other passengers enjoy the fast acceleration of the car. Therefore, traditional single-objective RL approach may fail to be applied in such scenarios. One way to tackle this issue is the multi-objective reinforcement learning (MORL) [Roijers et al., 2013, Yang et al., 2019, Natarajan and Tadepalli, 2005, Abels et al., 2018] method, which models the multiple objectives by a vectorized reward, and an additional preference vector to specify the relative importance of each objective. The agent of MORL needs to find policies to optimize the cumulative preference-weighted rewards under all possible preferences.
Enhanced Scene Specificity with Sparse Dynamic Value Estimation
Multi-scene reinforcement learning involves training the RL agent across multiple scenes / levels from the same task, and has become essential for many generalization applications. However, the inclusion of multiple scenes leads to an increase in sample variance for policy gradient computations, often resulting in suboptimal performance with the direct application of traditional methods (e.g. PPO, A3C). One strategy for variance reduction is to consider each scene as a distinct Markov decision process (MDP) and learn a joint value function dependent on both state (s) and MDP (M). However, this is non-trivial as the agent is usually unaware of the underlying level at train / test times in multi-scene RL. Recently, Singh et al. [1] tried to address this by proposing a dynamic value estimation approach that models the true joint value function distribution as a Gaussian mixture model (GMM). In this paper, we argue that the error between the true scene-specific value function and the predicted dynamic estimate can be further reduced by progressively enforcing sparse cluster assignments once the agent has explored most of the state space. The resulting agents not only show significant improvements in the final reward score across a range of OpenAI ProcGen environments, but also exhibit increased navigation efficiency while completing a game level.
Advanced AI: Deep Reinforcement Learning in Python
This course is all about the application of deep learning and neural networks to reinforcement learning. If you've taken my first reinforcement learning class, then you know that reinforcement learning is on the bleeding edge of what we can do with AI. Specifically, the combination of deep learning with reinforcement learning has led to AlphaGo beating a world champion in the strategy game Go, it has led to self-driving cars, and it has led to machines that can play video games at a superhuman level. Reinforcement learning has been around since the 70s but none of this has been possible until now. The world is changing at a very fast pace.
Machine Learning A-Z : Hands-On Python & R In Data Science
Learn to create Machine Learning Algorithms in Python and R from two Data Science experts. Code templates included. BESTSELLER, 4.5 (96,237 ratings), Created by Kirill Eremenko, Hadelin de Ponteves, SuperDataScience Team, SuperDataScience Support, English [Auto-generated], French [Auto-generated], 7 more Machine Learning A-Z™: Hands-On Python & R In Data Science Master Machine Learning on Python & R Have a great intuition of many Machine Learning models Make accurate predictions Make powerful analysis Make robust Machine Learning models Create strong added value to your business Use Machine Learning for personal purpose Handle specific topics like Reinforcement Learning, NLP and Deep Learning Handle advanced techniques like Dimensionality Reduction Know which Machine Learning model to choose for each type of problem Build an army of powerful Machine Learning models and know how to combine them to solve any problem PREVIEW THIS UDEMY COURSE -.> GET COUPON CODE
World Model as a Graph: Learning Latent Landmarks for Planning
Zhang, Lunjun, Yang, Ge, Stadie, Bradly C.
Planning - the ability to analyze the structure of a problem in the large and decompose it into interrelated subproblems - is a hallmark of human intelligence. While deep reinforcement learning (RL) has shown great promise for solving relatively straightforward control tasks, it remains an open problem how to best incorporate planning into existing deep RL paradigms to handle increasingly complex environments. One prominent framework, Model-Based RL, learns a world model and plans using step-by-step virtual rollouts. This type of world model quickly diverges from reality when the planning horizon increases, thus struggling at long-horizon planning. How can we learn world models that endow agents with the ability to do temporally extended reasoning? In this work, we propose to learn graph-structured world models composed of sparse, multi-step transitions. We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph. In this same graph, the edges are the reachability estimates distilled from Q-functions. On a variety of high-dimensional continuous control tasks ranging from robotic manipulation to navigation, we demonstrate that our method, named L3P, significantly outperforms prior work, and is oftentimes the only method capable of leveraging both the robustness of model-free RL and generalization of graph-search algorithms. We believe our work is an important step towards scalable planning in reinforcement learning.
Self-improving Chatbots based on Deep Reinforcement Learning
We present a Reinforcement Learning (RL) model for self-improving chatbots, specifically targeting FAQ-type chatbots. The model is not aimed at building a dialog system from scratch, but to leverage data from user conversations to improve chatbot performance. At the core of our approach is a score model, which is trained to score chatbot utterance-response tuples based on user feedback. The scores predicted by this model are used as rewards for the RL agent. Policy learning takes place offline, thanks to an user simulator which is fed with utterances from the FAQ-database.
REPAINT: Knowledge Transfer in Deep Actor-Critic Reinforcement Learning
Tao, Yunzhe, Genc, Sahika, Sun, Tao, Mallya, Sunil
Accelerating the learning processes for complex tasks by leveraging previously learned tasks has been one of the most challenging problems in reinforcement learning, especially when the similarity between source and target tasks is low or unknown. In this work, we propose a REPresentation-And-INstance Transfer algorithm (REPAINT) for deep actor-critic reinforcement learning paradigm. In representation transfer, we adopt a kickstarted training method using a pre-trained teacher policy by introducing an auxiliary cross-entropy loss. In instance transfer, we develop a sampling approach, i.e., advantage-based experience replay, on transitions collected following the teacher policy, where only the samples with high advantage estimates are retained for policy update. We consider both learning an unseen target task by transferring from previously learned teacher tasks and learning a partially unseen task composed of multiple sub-tasks by transferring from a pre-learned teacher sub-task. In several benchmark experiments, REPAINT significantly reduces the total training time and improves the asymptotic performance compared to training with no prior knowledge and other baselines.
An analysis of Reinforcement Learning applied to Coach task in IEEE Very Small Size Soccer
Pena, Carlos H. C., Machado, Mateus G., Barros, Mariana S., Silva, José D. P., Maciel, Lucas D., Ren, Tsang Ing, Barros, Edna N. S., Braga, Pedro H. M., Bassani, Hansenclever F.
The IEEE Very Small Size Soccer (VSSS) is a robot soccer competition in which two teams of three small robots play against each other. Traditionally, a deterministic coach agent will choose the most suitable strategy and formation for each adversary's strategy. Therefore, the role of a coach is of great importance to the game. In this sense, this paper proposes an end-to-end approach for the coaching task based on Reinforcement Learning (RL). The proposed system processes the information during the simulated matches to learn an optimal policy that chooses the current formation, depending on the opponent and game conditions. We trained two RL policies against three different teams (balanced, offensive, and heavily offensive) in a simulated environment. Our results were assessed against one of the top teams of the VSSS league, showing promising results after achieving a win/loss ratio of approximately 2.0.