Reinforcement Learning
Reinforcement Learning under Model Mismatch
Roy, Aurko, Xu, Huan, Pokutta, Sebastian
Reinforcement learning is concerned with learning a good policy for sequential decision making problems modeled as a Markov Decision Process (MDP), via interacting with the environment [22, 20]. In this work we address the problem of reinforcement learning from a misspecified model. As a motivating example, consider the scenario where the problem of interest is not directly accessible, but instead the agent can interact with a simulator whose dynamics is reasonably close to the true problem. Another plausible application is when the parameters of the model may evolve over time but can still be reasonably approximated by an MDP. To address this problem we use the framework of robust MDPs which was proposed by [2, 17, 13] to solve the planning problem under model misspecification. The robust MDP framework considers a class of models and finds the robust optimal policy which is a policy that performs best under the worst model. It was shown by [2, 17, 13] that the robust optimal policy satisfies the robust Bellman equation which naturally leads to exact dynamic programming algorithms to find an optimal policy. However, this approach is model dependent and does not immediately generalize to the model-free case where the parameters of the model are unknown. Essentially, reinforcement learning is a model-free framework to solve the Bellman equation using samples.
Boltzmann Exploration Done Right
Cesa-Bianchi, Nicolรฒ, Gentile, Claudio, Lugosi, Gรกbor, Neu, Gergely
Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optimal actions or spending too much time exploring the suboptimal ones? What is the right tuning for the learning rate? In this paper, we address several of these questions in the classic setup of stochastic multi-armed bandits. One of our main results is showing that the Boltzmann exploration strategy with any monotone learning-rate sequence will induce suboptimal behavior. As a remedy, we offer a simple non-monotone schedule that guarantees near-optimal performance, albeit only when given prior access to key problem parameters that are typically not available in practical situations (like the time horizon $T$ and the suboptimality gap $\Delta$). More importantly, we propose a novel variant that uses different learning rates for different arms, and achieves a distribution-dependent regret bound of order $\frac{K\log^2 T}{\Delta}$ and a distribution-independent bound of order $\sqrt{KT}\log K$ without requiring such prior knowledge. To demonstrate the flexibility of our technique, we also propose a variant that guarantees the same performance bounds even if the rewards are heavy-tailed.
UCB Exploration via Q-Ensembles
Chen, Richard Y., Sidor, Szymon, Abbeel, Pieter, Schulman, John
We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.
Machine Learning Scientist (Distributed Systems, Tensorflow) - Cambridge - November-04-2017 (FcARx)
We are currently seeking a hands-on Machine Learning Scientist (Distributed Systems, Tensorflow) for our new research-led startup, focussing on the application of artificial intelligence in the real world; particularly smart city simulations and bots. We're looking for a hardcore Machine Learning Scientist/Engineer who thrives wants to work with the latest technology in multi-agent learning algorithms, Gaussian process and reinforcement learning. As a Machine Learning Scientist/Engineer, you will be a core member of the machine learning team; working closely with the Machine Learning researchers, transforming their algorithmic research into highly innovative products which will be attractive and accessible to the world. Key Skills: Machine Learning Engineer/ML Scientist, Tensorflow, C, C, Java, Python, C#, Distributed Algorithms. Distributed systems, BSc, MSc, MPhil, PhD, Post-Doc, Research, R&D, startup, Multithreading.
Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning
Sharma, Sahil, J, Girish Raguvir, Ramesh, Srivatsan, Ravindran, Balaraman
Reinforcement Learning (RL) can model complex behavior policies for goal-directed sequential decision making tasks. A hallmark of RL algorithms is Temporal Difference (TD) learning: value function for the current state is moved towards a bootstrapped target that is estimated using next state's value function. $\lambda$-returns generalize beyond 1-step returns and strike a balance between Monte Carlo and TD learning methods. While lambda-returns have been extensively studied in RL, they haven't been explored a lot in Deep RL. This paper's first contribution is an exhaustive benchmarking of lambda-returns. Although mathematically tractable, the use of exponentially decaying weighting of n-step returns based targets in lambda-returns is a rather ad-hoc design choice. Our second major contribution is that we propose a generalization of lambda-returns called Confidence-based Autodidactic Returns (CAR), wherein the RL agent learns the weighting of the n-step returns in an end-to-end manner. This allows the agent to learn to decide how much it wants to weigh the n-step returns based targets. In contrast, lambda-returns restrict RL agents to use an exponentially decaying weighting scheme. Autodidactic returns can be used for improving any RL algorithm which uses TD learning. We empirically demonstrate that using sophisticated weighted mixtures of multi-step returns (like CAR and lambda-returns) considerably outperforms the use of n-step returns. We perform our experiments on the Asynchronous Advantage Actor Critic (A3C) algorithm in the Atari 2600 domain.
Double Q($\sigma$) and Q($\sigma, \lambda$): Unifying Reinforcement Learning Control Algorithms
Temporal-difference (TD) learning is an important field in reinforcement learning. Sarsa and Q-Learning are among the most used TD algorithms. The Q($\sigma$) algorithm (Sutton and Barto (2017)) unifies both. This paper extends the Q($\sigma$) algorithm to an online multi-step algorithm Q($\sigma, \lambda$) using eligibility traces and introduces Double Q($\sigma$) as the extension of Q($\sigma$) to double learning. Experiments suggest that the new Q($\sigma, \lambda$) algorithm can outperform the classical TD control methods Sarsa($\lambda$), Q($\lambda$) and Q($\sigma$).
A Deep Reinforcement Learning Chatbot
Serban, Iulian V., Sankar, Chinnadhurai, Germain, Mathieu, Zhang, Saizheng, Lin, Zhouhan, Subramanian, Sandeep, Kim, Taesup, Pieper, Michael, Chandar, Sarath, Ke, Nan Rosemary, Rajeshwar, Sai, de Brebisson, Alexandre, Sotelo, Jose M. R., Suhubdy, Dendi, Michalski, Vincent, Nguyen, Alexandre, Pineau, Joelle, Bengio, Yoshua
We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than many competing systems. Due to its machine learning architecture, the system is likely to improve with additional data.
Deep Reinforcement Learning: From Toys to Enteprise
Reinforcement learning is an increasingly popular machine learning technique that is particularly well suited for addressing problems within dynamic and adaptive environments. When paired with simulations, reinforcement learning is a powerful tool for training AI models that can help increase automation or optimize operational efficiency of sophisticated systems such as robotics, manufacturing, and supply chain logistics. However, moving from the games commonly used to demonstrate these techniques into real-world applications isn't always straightforward. Structuring solutions to move beyond purely data-driven training introduces all sorts of new complexity, requiring you to consider things like how to use simulations to target your learning objectives, what kinds of simulations are applicable, how to deal with long-running simulations, how to incorporate ongoing training refinement once deployed, how to account for scaling and performance, and ultimately how to bridge from simulation to the real world. I was recently able to talk about how to effectively leverage reinforcement learning in real-world use cases at the O'Reilly AI conference in San Francisco.
Analysis of Agent Expertise in Ms. Pac-Man using Value-of-Information-based Policies
Sledge, Isaac J., Principe, Jose C.
Conventional reinforcement learning methods for Markov decision processes rely on weakly-guided, stochastic searches to drive the learning process. It can therefore be difficult to predict what agent behaviors might emerge. In this paper, we consider an information-theoretic cost function for performing constrained stochastic searches that promote the formation of risk-averse to risk-favoring behaviors. This cost function is the value of information, which provides the optimal trade-off between the expected return of a policy and the policy's complexity; policy complexity is measured by number of bits and controlled by a single hyperparameter on the cost function. As the policy complexity is reduced, the agents will increasingly eschew risky actions. This reduces the potential for high accrued rewards. As the policy complexity increases, the agents will take actions, regardless of the risk, that can raise the long-term rewards. The obtainable reward depends on a single, tunable hyperparameter that regulates the degree of policy complexity. We evaluate the performance of value-of-information-based policies on a stochastic version of Ms. Pac-Man. A major component of this paper is the demonstration that ranges of policy complexity values yield different game-play styles and explaining why this occurs. We also show that our reinforcement-learning search mechanism is more efficient than the others we utilize. This result implies that the value of information theory is appropriate for framing the exploitation-exploration trade-off in reinforcement learning.