Reinforcement Learning
DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters
Peng, Yanghua, Bao, Yixin, Chen, Yangrui, Wu, Chuan, Meng, Chen, Lin, Wei
More and more companies have deployed machine learning (ML) clusters, where deep learning (DL) models are trained for providing various AI-driven services. Efficient resource scheduling is essential for maximal utilization of expensive DL clusters. Existing cluster schedulers either are agnostic to ML workload characteristics, or use scheduling heuristics based on operators' understanding of particular ML framework and workload, which are less efficient or not general enough. In this paper, we show that DL techniques can be adopted to design a generic and efficient scheduler. DL2 is a DL-driven scheduler for DL clusters, targeting global training job expedition by dynamically resizing resources allocated to jobs. DL2 advocates a joint supervised learning and reinforcement learning approach: a neural network is warmed up via offline supervised learning based on job traces produced by the existing cluster scheduler; then the neural network is plugged into the live DL cluster, fine-tuned by reinforcement learning carried out throughout the training progress of the DL jobs, and used for deciding job resource allocation in an online fashion. By applying past decisions made by the existing cluster scheduler in the preparatory supervised learning phase, our approach enables a smooth transition from existing scheduler, and renders a high-quality scheduler in minimizing average training completion time. We implement DL2 on Kubernetes and enable dynamic resource scaling in DL jobs on MXNet. Extensive evaluation shows that DL2 outperforms fairness scheduler (i.e., DRF) by 44.1% and expert heuristic scheduler (i.e., Optimus) by 17.5% in terms of average job completion time.
Towards an Adaptive Robot for Sports and Rehabilitation Coaching
Ross, Martin K., Broz, Frank, Baillie, Lynne
The work presented in this paper aims to explore how, and to what extent, an adaptive robotic coach has the potential to provide extra motivation to adhere to long-term rehabilitation and help fill the coaching gap which occurs during repetitive solo practice in high performance sport. Adapting the behavior of a social robot to a specific user, using reinforcement learning (RL), could be a way of increasing adherence to an exercise routine in both domains. The requirements gathering phase is underway and is presented in this paper along with the rationale of using RL in this context.
Flight Controller Synthesis Via Deep Reinforcement Learning
Traditional control methods are inadequate in many deployment settings involving control of Cyber-Physical Systems (CPS). In such settings, CPS controllers must operate and respond to unpredictable interactions, conditions, or failure modes. Dealing with such unpredictability requires the use of executive and cognitive control functions that allow for planning and reasoning. Motivated by the sport of drone racing, this dissertation addresses these concerns for state-of-the-art flight control by investigating the use of deep neural networks to bring essential elements of higher-level cognition for constructing low level flight controllers. This thesis reports on the development and release of an open source, full solution stack for building neuro-flight controllers. This stack consists of the methodology for constructing a multicopter digital twin for synthesize the flight controller unique to a specific aircraft, a tuning framework for implementing training environments (GymFC), and a firmware for the world's first neural network supported flight controller (Neuroflight). GymFC's novel approach fuses together the digital twinning paradigm for flight control training to provide seamless transfer to hardware. Additionally, this thesis examines alternative reward system functions as well as changes to the software environment to bridge the gap between the simulation and real world deployment environments. Work summarized in this thesis demonstrates that reinforcement learning is able to be leveraged for training neural network controllers capable, not only of maintaining stable flight, but also precision aerobatic maneuvers in real world settings. As such, this work provides a foundation for developing the next generation of flight control systems.
Petri Net Machines for Human-Agent Interaction
Dondrup, Christian, Papaioannou, Ioannis, Lemon, Oliver
Smart speakers and robots become ever more prevalent in our daily lives. These agents are able to execute a wide range of tasks and actions and, therefore, need systems to control their execution. Current state-of-the-art such as (deep) reinforcement learning, however, requires vast amounts of data for training which is often hard to come by when interacting with humans. To overcome this issue, most systems still rely on Finite State Machines. We introduce Petri Net Machines which present a formal definition for state machines based on Petri Nets that are able to execute concurrent actions reliably, execute and interleave several plans at the same time, and provide an easy to use modelling language. We show their workings based on the example of Human-Robot Interaction in a shopping mall.
DataWorkshop Club Conf 2019 Machine Learning Conference Online
Recent years have seen a rising interest in developing AI algorithms for real world big data domains ranging from autonomous cars to personalized assistants. At the core of these algorithms are architectures that combine deep neural networks, for approximating the underlying multidimensional state-spaces, with reinforcement learning, for controlling agents that learn to operate in said state-spaces towards achieving a given objective. The talk will first outline notable past and future efforts in deep reinforcement learning as well as identify fundamental problems that this technology has been struggling to overcome. Towards mitigating these problems (and open up an alternative path to general artificial intelligence), I will then summarize a brain computing model of intelligence, rooted in the latest findings in neuroscience. The talk will conclude with an overview of the recent research efforts in the field of multi-agent systems, to provide the future teams of humans and agents with the necessary tools that allow them to safely co-exist.
Reinforcement Learning for Portfolio Management
T raditionally, mathematical formulations of dynamical systems in the context of Signal Processing and Control Theory have been a lynchpin of today's Financial Engineering. More recently, advances in sequential decision making, mainly through the concept of Reinforcement Learning, have been instrumental in the development of multistage stochastic optimization, a key component in sequential portfolio optimization (asset allocation) strategies. In this thesis, we develop a comprehensive account of the expressive power, modelling efficiency, and performance advantages of so called trading agents (i.e., Deep Soft Recurrent Q-Network (DSRQN) and Mixture of Score Machines (MSM)), based on both traditional system identification (model-based approach) as well as on context-independent agents (model-free approach). The analysis provides a conclusive support for the ability of model-free reinforcement learning methods to act as universal trading agents, which are not only capable of reducing the computational and memory complexity (owing to their linear scaling with size of the universe), but also serve as generalizing strategies across assets and markets, regardless of the trading universe on which they have been trained. The relatively low volume of daily returns in financial market data is addressed via data augmentation (a generative approach) and a choice of pre-training strategies, both of which are validated against current state-of-the-art models. For rigour, a risk-sensitive framework which includes transaction costs is considered, and its performance advantages are demonstrated in a variety of scenarios, from synthetic time-series (sinusoidal, sawtooth and chirp waves), ii simulated market series (surrogate data based), through to real market data (S&P 500 and EURO STOXX 50). The analysis and simulations confirm the superiority of universal model-free reinforcement learning agents over current portfolio management model in asset allocation strategies, with the achieved performance advantage of as much as 9.2% in annualized cumulative returns and 13.4% in annualized Sharpe Ratio.
Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes
Kallus, Nathan, Uehara, Masatoshi
Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian, time-invariant, and ergodic structure in efficient OPE. We first derive the efficiency limits for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in ergodic time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.
Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies
Cowan, Wesley, Katehakis, Michael N., Pirutinsky, Daniel
In this paper we consider the basic version of Reinforcement Learning (RL) that involves computing optimal data driven (adaptive) policies for Markovian decision process with unknown transition probabilities. We provide a brief survey of the state of the art of the area and we compare the performance of the classic UCB policy of \cc{bkmdp97} with a new policy developed herein which we call MDP-Deterministic Minimum Empirical Divergence (MDP-DMED), and a method based on Posterior sampling (MDP-PS).