Reinforcement Learning
Winning Blackjack using Machine Learning – Towards Data Science
One of the great things about machine learning is that there are so many different approaches to solving problems. Neural networks are great for finding patterns in data, resulting in predictive capabilities that are truly impressive. Reinforcement learning uses rewards-based concepts, improving over time. And then there's the approach called a genetic algorithm. A genetic algorithm (GA) uses principles from evolution to solve problems.
Learning preferences by looking at the world
It would be great if we could all have household robots do our chores for us. Chores are tasks that we want done to make our houses cater more to our preferences; they are a way in which we want our house to be different from the way it currently is. However, most "different" states are not very desirable: Surely our robot wouldn't be so dumb as to go around breaking stuff when we ask it to clean our house? Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might've inadvertently left out. Generally, it is easy to get the reward wrong by forgetting to include preferences for things that should stay the same, since we are so used to having these preferences satisfied, and there are so many of them.
AI guides single-camera drone through hallways it's never seen before
Researchers at the University of Colorado recently demonstrated a system that helps robots figure out the direction of hiking trails from camera footage, and scientists at ETH Zurich described in a January paper a machine learning framework that aids four-legged robots in getting up from the ground when they trip and fall. But might such AI perform just as proficiently when applied to a drone rather than machines planted firmly on the ground? A team at the University of California at Berkeley set out to find out. In a newly published paper on the preprint server Arxiv ("Generalization through Simulation: Integrating Simulated and Real Data into Deep Reinforcement Learning for Vision-Based Autonomous Flight"), the team proposes a "hybrid" deep reinforcement learning algorithm that combines data from both a digital simulation and the real world to guide a quadcopter through carpeted corridors. "In this work, we … aim to devise a transfer learning algorithm where the physical behavior of the vehicle is learned," the paper's authors wrote. "In essence, real-world experience is used to learn how to fly, while simulated experience is used to learn how to generalize."
Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning
We propose a new policy iteration theory as an important extension of soft policy iteration and Soft Actor-Critic (SAC), one of the most efficient model free algorithms for deep reinforcement learning. Supported by the new theory, arbitrary entropy measures that generalize Shannon entropy, such as Tsallis entropy and Renyi entropy, can be utilized to properly randomize action selection while fulfilling the goal of maximizing expected long-term rewards. Our theory gives birth to two new algorithms, i.e., Tsallis entropy Actor-Critic (TAC) and Renyi entropy Actor-Critic (RAC). Theoretical analysis shows that these algorithms can be more effective than SAC. Moreover, they pave the way for us to develop a new Ensemble Actor-Critic (EAC) algorithm in this paper that features the use of a bootstrap mechanism for deep environment exploration as well as a new value-function based mechanism for high-level action selection. Empirically we show that TAC, RAC and EAC can achieve state-of-the-art performance on a range of benchmark control tasks, outperforming SAC and several cutting-edge learning algorithms in terms of both sample efficiency and effectiveness.
On Reinforcement Learning Using Monte Carlo Tree Search with Supervised Learning: Non-Asymptotic Analysis
Shah, Devavrat, Xie, Qiaomin, Xu, Zhi
Inspired by the success of AlphaGo Zero (AGZ) which utilizes Monte Carlo Tree Search (MCTS) with Supervised Learning via Neural Network to learn the optimal policy and value function, in this work, we focus on establishing formally that such an approach indeed finds optimal policy asymptotically, as well as establishing non-asymptotic guarantees in the process. We shall focus on infinite-horizon discounted Markov Decision Process to establish the results. To start with, it requires establishing the MCTS's claimed property in the literature that for any given query state, MCTS provides approximate value function for the state with enough simulation steps of MDP. We provide non-asymptotic analysis establishing this property by analyzing a non-stationary multi-arm bandit setup. Our proof suggests that MCTS needs to be utilized with polynomial rather than logarithmic "upper confidence bound" for establishing its desired performance -- interestingly enough, AGZ chooses such polynomial bound. Using this as a building block, combined with nearest neighbor supervised learning, we argue that MCTS acts as a "policy improvement" operator; it has a natural "bootstrapping" property to iteratively improve value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn $\varepsilon$ approximation of value function in $\ell_\infty$ norm, MCTS combined with nearest-neighbors requires samples scaling as $\widetilde{O}\big(\varepsilon^{-(d+4)}\big)$, where $d$ is the dimension of the state space. This is nearly optimal due to a minimax lower bound of $\widetilde{\Omega}\big(\varepsilon^{-(d+2)}\big).$
Sample-Optimal Parametric Q-Learning with Linear Transition Models
Consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express the process's probabilistic transition model. We propose a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space. To further improve its sample efficiency, we exploit the monotonicity property and intrinsic noise structure of the Bellman operator, provided the existence of anchor state-actions that imply implicit non-negativity in the feature space. We augment the algorithm using techniques of variance reduction, monotonicity preservation, and confidence bounds. It is proved to find a policy which is $\epsilon$-optimal from any initial state with high probability using $\widetilde{O}(K/\epsilon^2(1-\gamma)^3)$ sample transitions for arbitrarily large-scale MDP with a discount factor $\gamma\in(0,1)$. A matching information-theoretical lower bound is proved, confirming the sample optimality of the proposed method with respect to all parameters (up to polylog factors).
ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero
Tian, Yuandong, Ma, Jerry, Gong, Qucheng, Sengupta, Shubho, Chen, Zhuoyuan, Pinkerton, James, Zitnick, C. Lawrence
The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are a remarkable demonstration of deep reinforcement learning's capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available.
Sample Complexity of Estimating the Policy Gradient for Nearly Deterministic Dynamical Systems
Reinforcement learning is a promising approach to learning robot controllers. It has recently been shown that algorithms based on finite-difference estimates of the policy gradient are competitive with algorithms based on the policy gradient theorem. We propose a theoretical framework for understanding this phenomenon. Our key insight is that many dynamical systems (especially those of interest in robot control tasks) are \emph{nearly deterministic}---i.e., they can be modeled as a deterministic system with a small stochastic perturbation. We show that for such systems, finite-difference estimates of the policy gradient can have substantially lower variance than estimates based on the policy gradient theorem. We interpret these results in the context of counterfactual estimation. Finally, we empirically evaluate our insights in an experiment on the inverted pendulum.
Understanding The Impact of Partner Choice on Cooperation and Social Norms by means of Multi-agent Reinforcement Learning
Anastassacos, Nicolas, Hailes, Steve, Musolesi, Mirco
The human ability to coordinate and cooperate has been vital to the development of societies for thousands of years. While it is not fully clear how this behavior arises, social norms are thought to be a key factor in this development. In contrast to laws set by authorities, norms tend to evolve in a bottom-up manner from interactions between members of a society. While much behavior can be explained through the use of social norms, it is difficult to measure the extent to which they shape society as well as how they are affected by other societal dynamics. In this paper, we discuss the design and evaluation of a reinforcement learning model for understanding how the opportunity to choose who you interact with in a society affects the overall societal outcome and the strength of social norms. We first study the emergence of norms and then the emergence of cooperation in presence of norms. In our model, agents interact with other agents in a society in the form of repeated matrix-games: coordination games and cooperation games. In particular, in our model, at each each stage, agents are either able to choose a partner to interact with or are forced to interact at random and learn using policy gradients.
Global Big Data Conference
The capability of a machine to "learn" on its own is the subject of some debate. With traditional supervised machine learning, decisions can be optimized, but the machine isn't really learning by itself. Now a startup called Cogitai is hoping to push the limits of a machine's capability to learn continuously using reinforcement learning techniques. Cogitai was founded in 2015 by some of the earliest innovators in the reinforcement learning (RL) field, including Mark Ring, Peter Stone, and Pete Wurman. The Orange County, California is hoping to leverage the collective RL knowledge work of its founders and the 15 or so PhD computer scientists in the firm to change the course of AI applications.