Reinforcement Learning
Benchmarking Reinforcement Learning Algorithms on Real-World Robots
Mahmood, A. Rupam, Korenkevych, Dmytro, Vasan, Gautham, Ma, William, Bergstra, James
Through many recent successes in simulation, model-free reinforcement learning has emerged as a promising approach to solving continuous control robotic tasks. The research community is now able to reproduce, analyze and build quickly on these results due to open source implementations of learning algorithms and simulated benchmark tasks. To carry forward these successes to real-world applications, it is crucial to withhold utilizing the unique advantages of simulations that do not transfer to the real world and experiment directly with physical robots. However, reinforcement learning research with physical robots faces substantial resistance due to the lack of benchmark tasks and supporting source code. In this work, we introduce several reinforcement learning tasks with multiple commercially available robots that present varying levels of learning difficulty, setup, and repeatability. On these tasks, we test the learning performance of off-the-shelf implementations of four reinforcement learning algorithms and analyze sensitivity to their hyper-parameters to determine their readiness for applications in various real-world tasks. Our results show that with a careful setup of the task interface and computations, some of these implementations can be readily applicable to physical robots. We find that state-of-the-art learning algorithms are highly sensitive to their hyper-parameters and their relative ordering does not transfer across tasks, indicating the necessity of re-tuning them for each task for best performance. On the other hand, the best hyper-parameter configuration from one task may often result in effective learning on held-out tasks even with different robots, providing a reasonable default. We make the benchmark tasks publicly available to enhance reproducibility in real-world reinforcement learning.
TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game
Sun, Peng, Sun, Xinghai, Han, Lei, Xiong, Jiechao, Wang, Qing, Li, Bo, Zheng, Yang, Liu, Ji, Liu, Yongsheng, Liu, Han, Zhang, Tong
Starcraft II (SCII) is widely considered as the most challenging Real Time Strategy (RTS) game as of now, due to large observation space, huge (continuous and infinite) action space, partial observation, multi-player simultaneous game model, long time horizon decision, etc. To push the frontier of AI's capability, Deepmind and Blizzard jointly present the StarCraft II Learning Environment (SC2LE) --- a testbench for designing complex decision making systems. While SC2LE provides a few mini games such as \textit{MoveToBeacon}, \textit{CollectMineralShards}, and \textit{DefeatRoaches} where some AI agents achieve the professional player's level, it is still far away from achieving the professional level in a \emph{full} game. To initialize the research and investigation in the full game, we develop two AI agents --- the AI agent TStarBot1 is based on deep reinforcement learning over flat action structure, and the AI agent TStarBot2 is based on rule controller over hierarchical action structure. Both TStarBot1 and TStarBot2 are able to defeat the builtin AI agents from level 1 to level 10 in a full game (1v1 \textit{Zerg}-vs-\textit{Zerg} game on the AbyssalReef map), noting that level 8, level 9, and level 10 are cheating agents with full vision on the whole map, with resource harvest boosting, and with both, respectively \footnote{According to some informal discussions from the StarCraft II forum, level 10 builtin AI is estimated to be Platinum to Diamond~\cite{scii-forum}, which are equivalent to top 50\% - 30\% human players in the ranking system of Battle.net Leagues~\cite{liquid}. }.
Predicting Periodicity with Temporal Difference Learning
De Asis, Kristopher, Bennett, Brendan, Sutton, Richard S.
Temporal difference (TD) learning is an important approach in reinforcement learning, as it combines ideas from dynamic programming and Monte Carlo methods in a way that allows for online and incremental model-free learning. A key idea of TD learning is that it is learning predictive knowledge about the environment in the form of value functions, from which it can derive its behavior to address long-term sequential decision making problems. The agent's horizon of interest, that is, how immediate or long-term a TD learning agent predicts into the future, is adjusted through a discount rate parameter. In this paper, we introduce an alternative view on the discount rate, with insight from digital signal processing, to include complex-valued discounting. Our results show that setting the discount rate to appropriately chosen complex numbers allows for online and incremental estimation of the Discrete Fourier Transform (DFT) of a signal of interest with TD learning. We thereby extend the types of knowledge representable by value functions, which we show are particularly useful for identifying periodic effects in the reward sequence.
Prosocial or Selfish? Agents with different behaviors for Contract Negotiation using Reinforcement Learning
Sunder, Vishal, Vig, Lovekesh, Chatterjee, Arnab, Shroff, Gautam
We present an effective technique for training deep learning agents capable of negotiating on a set of clauses in a contract agreement using a simple communication protocol. We use Multi Agent Reinforcement Learning to train both agents simultaneously as they negotiate with each other in the training environment. We also model selfish and prosocial behavior to varying degrees in these agents. Empirical evidence is provided showing consistency in agent behaviors. We further train a meta agent with a mixture of behaviors by learning an ensemble of different models using reinforcement learning. Finally, to ascertain the deployability of the negotiating agents, we conducted experiments pitting the trained agents against human players. Results demonstrate that the agents are able to hold their own against human players, often emerging as winners in the negotiation. Our experiments demonstrate that the meta agent is able to reasonably emulate human behavior.
Deterministic Implementations for Reproducibility in Deep Reinforcement Learning
Nagarajan, Prabhat, Warnell, Garrett, Stone, Peter
While deep reinforcement learning (DRL) has led to numerous successes in recent years, reproducing these successes can be extremely challenging. One reproducibility challenge particularly relevant to DRL is nondeterminism in the training process, which can substantially affect the results. Motivated by this challenge, we study the positive impacts of deterministic implementations in eliminating nondeterminism in training. To do so, we consider the particular case of the deep Q-learning algorithm, for which we produce a deterministic implementation by identifying and controlling all sources of nondeterminism in the training process. One by one, we then allow individual sources of nondeterminism to affect our otherwise deterministic implementation, and measure the impact of each source on the variance in performance. We find that individual sources of nondeterminism can substantially impact the performance of agent, illustrating the benefits of deterministic implementations. In addition, we also discuss the important role of deterministic implementations in achieving exact replicability of results.
Reinforcement Learning Series Intro - Syllabus Overview
Welcome to this series on reinforcement learning! We'll first start out by introducing the absolute basics to build a solid ground for us to run. We'll then progress onto more advanced and sophisticated topics that integrate artificial neural networks and deep learning into reinforcement learning. We'll also be getting our hands dirty by implementing some super cool reinforcement learning projects in code! Without further ado, let's get to it!
Model-Free Adaptive Optimal Control of Sequential Manufacturing Processes using Reinforcement Learning
Dornheim, Johannes, Link, Norbert, Gumbsch, Peter
A self-learning optimal control algorithm for sequential manufacturing processes with time-discrete control actions is proposed and evaluated with simulated deep drawing processes. The necessary control model is built during consecutive process executions under optimal control via Reinforcement Learning, using the measured product quality as reward after each process execution. Prior model formation, which is required by state-of-the-art algorithms like Model Predictive Control and Approximate Dynamic Programming, is therefore obsolete. This avoids the difficulties in system identification and accurate modelling, which arise with processes subject to non-linear dynamics and stochastic influences. Also runtime complexity problems of these approaches are avoided, which arise when more complex models and larger control prediction horizons are employed. Instead of using pre-created process- and observation-models, Reinforcement Learning algorithms build functions of expected future reward during processing, which are then used for optimal process control decisions. The learning of such expectation functions is realized online by interacting with the process. The proposed algorithm also takes stochastic variations of the process conditions into consideration and is able to cope with partial observability. A method for the adaptive optimal control of partially observable fixed-horizon manufacturing processes, based on Q-learning is developed and studied. The resulting algorithm is instantiated and then evaluated by application to a time-stochastic optimal control problem in metal sheet deep drawing, where the experiments use FEM-simulated processes. The Reinforcement Learning based control shows superior results over the model-based Model Predictive Control and Approximate Dynamic Programming approaches.
Interpretable Reinforcement Learning with Ensemble Methods
Brown, Alexander, Petrik, Marek
Reinforcement learning continues to break bounds on what we even thought possible, recently with AlphaGo's triumph over leading Go player Lee Sedol and with the further successes of AlphaGoZero, which surpassed AlphaGo learning only from self-play [14]. While the performance of such systems is impressive and very useful, sometimes it is desirable to understand and interpret the actions of a reinforcement learning system, and machine learning systems in general. These circumstances are more common in high-pressure applications, such as healthcare, targeted advertising, or finance [6]. For example, researchers at the University of Pittsburgh Medical Center trained a variety of machine learning models including neural networks and decision trees to predict whether pneumonia patients might develop severe complications. The neural networks performed the best on their testing data, but upon examination of the rules of the decision trees, the researchers found that the trees recommended sending pneumonia patients who had asthma directly home, despite the fact that asthma makes patients with pneumonia much more likely to suffer complications.
Leveraging Contact Forces for Learning to Grasp
Merzic, Hamza, Bogdanovic, Miroslav, Kappler, Daniel, Righetti, Ludovic, Bohg, Jeannette
Grasping objects under uncertainty remains an open problem in robotics research. This uncertainty is often due to noisy or partial observations of the object pose or shape. To enable a robot to react appropriately to unforeseen effects, it is crucial that it continuously takes sensor feedback into account. While visual feedback is important for inferring a grasp pose and reaching for an object, contact feedback offers valuable information during manipulation and grasp acquisition. In this paper, we use model-free deep reinforcement learning to synthesize control policies that exploit contact sensing to generate robust grasping under uncertainty. We demonstrate our approach on a multi-fingered hand that exhibits more complex finger coordination than the commonly used two-fingered grippers. We conduct extensive experiments in order to assess the performance of the learned policies, with and without contact sensing. While it is possible to learn grasping policies without contact sensing, our results suggest that contact feedback allows for a significant improvement of grasping robustness under object pose uncertainty and for objects with a complex shape.
SCC-rFMQ Learning in Cooperative Markov Games with Continuous Actions
Zhang, Chengwei, Li, Xiaohong, Hao, Jianye, Chen, Siqi, Tuyls, Karl, Feng, Zhiyong, Xue, Wanli, Chen, Rong
Although many reinforcement learning methods have been proposed for learning the optimal solutions in single-agent continuousaction domains, multiagent coordination domains with continuous actions have received relatively few investigations. In this paper, we propose an independent learner hierarchical method, named Sample Continuous Coordination with recursive Frequency Maximum Q-Value (SCC-rFMQ), which divides the cooperative problem with continuous actions into two layers. The first layer samples a finite set of actions from the continuous action spaces by a re-sampling mechanism with variable exploratory rates, and the second layer evaluates the actions in the sampled action set and updates the policy using a reinforcement learning cooperative method. By constructing cooperative mechanisms at both levels, SCC-rFMQ can handle cooperative problems in continuous action cooperative Markov games effectively. The effectiveness of SCC-rFMQ is experimentally demonstrated on two well-designed games, i.e., a continuous version of the climbing game and a cooperative version of the boat problem. Experimental results show that SCC-rFMQ outperforms other reinforcement learning algorithms. A large number of multiagent coordination domains involve continuous action spaces, such as robot soccer [1] and multiplayer online battle arena game [2]. In such environments, agents not only need to coordinate with other agents towards desirable outcomes efficiently but also have to deal with infinitely large action spaces.