Reinforcement Learning
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
Fruit, Ronan, Pirotta, Matteo, Lazaric, Alessandro
While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with $S^{\texttt{C}}$ communicating states, $A$ actions and $\Gamma^{\texttt{C}} \leq S^{\texttt{C}}$ possible communicating next states, we derive a $\widetilde{O}(D^{\texttt{C}} \sqrt{\Gamma^{\texttt{C}} S^{\texttt{C}} AT})$ regret bound, where $D^{\texttt{C}}$ is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art.
Ranked Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimization
Laterre, Alexandre, Fu, Yunguan, Jabri, Mohamed Khalil, Cohen, Alain-Sam, Kas, David, Hajjar, Karl, Dahl, Torbjorn S., Kerkeni, Amine, Beguir, Karim
Adversarial self-play in two-player games has delivered impressive results when used with reinforcement learning algorithms that combine deep neural networks and tree search. Algorithms like AlphaZero and Expert Iteration learn tabula-rasa, producing highly informative training data on the fly. However, the self-play training strategy is not directly applicable to single-player games. Recently, several practically important combinatorial optimization problems, such as the traveling salesman problem and the bin packing problem, have been reformulated as reinforcement learning problems, increasing the importance of enabling the benefits of self-play beyond two-player games. We present the Ranked Reward (R2) algorithm which accomplishes this by ranking the rewards obtained by a single agent over multiple games to create a relative performance metric. Results from applying the R2 algorithm to instances of a two-dimensional bin packing problem show that it outperforms generic Monte Carlo tree search, heuristic algorithms and reinforcement learning algorithms not using ranked rewards.
Variance Reduction for Reinforcement Learning in Input-Driven Environments
Mao, Hongzi, Venkatakrishnan, Shaileshh Bojja, Schwarzkopf, Malte, Alizadeh, Mohammad
We consider reinforcement learning in input-driven environments, where an exogenous, stochastic input process affects the dynamics of the system. Input processes arise in many applications, including queuing systems, robotics control with disturbances, and object tracking. Since the state dynamics and rewards depend on the input process, the state alone provides limited information for the expected future returns. Therefore, policy gradient methods with standard state-dependent baselines suffer high variance during training. We derive a bias-free, input-dependent baseline to reduce this variance, and analytically show its benefits over state-dependent baselines. We then propose a meta-learning approach to overcome the complexity of learning a baseline that depends on a long sequence of inputs. Our experimental results show that across environments from queuing systems, computer networks, and MuJoCo robotic locomotion, input-dependent baselines consistently improve training stability and result in better eventual policies.
Arcades: A deep model for adaptive decision making in voice controlled smart-home
Brenon, Alexis, Portet, Franรงois, Vacher, Michel
Smart-home is an application domain which brings together home automation and ambient intelligence to ease life of dwellers and to provide support to people in loss of autonomy. The development of smarthomes is not only a cultural and technological evolution but is also recognized as one way to address the challenges created by an aging population in developed countries [42]. If home automation is concerned with sensing (sensors, actuators, middle-ware) and low-level automation (heating control, lighting control), Ambient Intelligence should provide perception and reasoning capabilities into the smart-home ecosystem. However, although the development of smart-homes is supported by a large amount of research and industrial projects, it has not reached a large public since many challenges are still to be addressed. One of the main challenges is due to the complexity of setting up the smart-home system in case of new situations (devices, house, dwellers, after an accident, etc.).
How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments
Colas, Cรฉdric, Sigaud, Olivier, Oudeyer, Pierre-Yves
Consistently checking the statistical significance of experimental results is one of the mandatory methodological steps to address the so-called "reproducibility crisis" in deep reinforcement learning. In this tutorial paper, we explain how the number of random seeds relates to the probabilities of statistical errors. For both the t-test and the bootstrap confidence interval test, we recall theoretical guidelines to determine the number of random seeds one should use to provide a statistically significant comparison of the performance of two algorithms. Finally, we discuss the influence of deviations from the assumptions usually made by statistical tests. We show that they can lead to inaccurate evaluations of statistical errors and provide guidelines to counter these negative effects. We make our code available to perform the tests.
Goal-oriented Trajectories for Efficient Exploration
Pardo, Fabio, Levdik, Vitaly, Kormushev, Petar
Exploration is a difficult challenge in reinforcement learning and even recent state-of-the art curiosity-based methods rely on the simple epsilon-greedy strategy to generate novelty. We argue that pure random walks do not succeed to properly expand the exploration area in most environments and propose to replace single random action choices by random goals selection followed by several steps in their direction. This approach is compatible with any curiosity-based exploration and off-policy reinforcement learning agents and generates longer and safer trajectories than individual random actions. To illustrate this, we present a task-independent agent that learns to reach coordinates in screen frames and demonstrate its ability to explore with the game Super Mario Bros. improving significantly the score of a baseline DQN agent.
Deep Reinforcement Learning for Doom using Unsupervised Auxiliary Tasks
Papoudakis, Georgios, Chatzidimitriou, Kyriakos C., Mitkas, Pericles A.
Recent developments in deep reinforcement learning have enabled the creation of agents for solving a large variety of games given a visual input. These methods have been proven successful for 2D games, like the Atari games, or for simple tasks, like navigating in mazes. It is still an open question, how to address more complex environments, in which the reward is sparse and the state space is huge. In this paper we propose a divide and conquer deep reinforcement learning solution and we test our agent in the first person shooter (FPS) game of Doom. Our work is based on previous works in deep reinforcement learning and in Doom agents. We also present how our agent is able to perform better in unknown environments compared to a state of the art reinforcement learning algorithm.
r/MachineLearning - [R] Learning to Drive in a Day with Deep Reinforcement Learning
Abstract: We demonstrate the first application of deep reinforcement learning to autonomous driving. From randomly initialised parameters, our model is able to learn a policy for lane following in a handful of training episodes using a single monocular image as input. We provide a general and easy to obtain reward: the distance travelled by the vehicle without the safety driver taking control. We use a continuous, model-free deep reinforcement learning algorithm, with all exploration and optimisation performed on-vehicle. This demonstrates a new framework for autonomous driving which moves away from reliance on defined logical rules, mapping, and direct supervision.
Per-decision Multi-step Temporal Difference Learning with Control Variates
De Asis, Kristopher, Sutton, Richard S.
Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. They address a bias-variance trade off between reliance on current estimates, which could be poor, and incorporating longer sampled reward sequences into the updates. Especially in the off-policy setting, where the agent aims to learn about a policy different from the one generating its behaviour, the variance in the updates can cause learning to diverge as the number of sampled rewards used in the estimates increases. In this paper, we introduce per-decision control variates for multi-step TD algorithms, and compare them to existing methods. Our results show that including the control variates can greatly improve performance on both on and off-policy multi-step temporal difference learning tasks.
Supervised Reinforcement Learning with Recurrent Neural Network for Dynamic Treatment Recommendation
Wang, Lu, Zhang, Wei, He, Xiaofeng, Zha, Hongyuan
Dynamic treatment recommendation systems based on large-scale electronic health records (EHRs) become a key to successfully improve practical clinical outcomes. Prior relevant studies recommend treatments either use supervised learning (e.g. matching the indicator signal which denotes doctor prescriptions), or reinforcement learning (e.g. maximizing evaluation signal which indicates cumulative reward from survival rates). However, none of these studies have considered to combine the benefits of supervised learning and reinforcement learning. In this paper, we propose Supervised Reinforcement Learning with Recurrent Neural Network (SRL-RNN), which fuses them into a synergistic learning framework. Specifically, SRL-RNN applies an off-policy actor-critic framework to handle complex relations among multiple medications, diseases and individual characteristics. The "actor" in the framework is adjusted by both the indicator signal and evaluation signal to ensure effective prescription and low mortality. RNN is further utilized to solve the Partially-Observed Markov Decision Process (POMDP) problem due to the lack of fully observed states in real world applications. Experiments on the publicly real-world dataset, i.e., MIMIC-3, illustrate that our model can reduce the estimated mortality, while providing promising accuracy in matching doctors' prescriptions.