The Rubik's Cube is a famous 3-D puzzle toy. A regular Rubik's Cube has six faces, each of which has nine coloured stickers, and the puzzle is solved when each face has a united colour. If we count one quarter (90) turn as one move and two quarter turns (a "face" turn) as two moves, the best algorithms human-invented can solve any instance of the cube in 26 moves. My target is to let the computer learn how to solve the Rubik's Cube without feeding it any human knowledge like the symmetry of the cube. The most challenging part is the Rubik's Cube has 43,252,003,274,489,856,000 possible permutations.
In recent years, we've seen a resurgence in AI, or artificial intelligence, and machine learning. Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts. Google's AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning. Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.
My objective with this article is to demystify a few foundational Reinforcement Learning (RL) concepts with hands-on examples. We are going to apply RL to the infamous Glass Bridge challenge from the Netflix series Squid Game episode 7. Although no previous RL knowledge is required, solid Python coding skills and basic machine learning understanding are necessary to follow the content of this article. The code can be found here. In simple words, RL is a computational approach used to achieve a pre-defined goal, which can be winning a chess game, optimizing a medical treatment, or improving a financial trading strategy.
References to artificial intelligence (AI) beings have appeared throughout time since antiquity . Indeed, it was the study of formal reasoning, with philosophers and mathematicians at this time who started this inquiry. Then, much later, in more recent times it was the study of mathematical logic which led computer scientist Alan Turing to develop his theory of computation. Alan Turning is perhaps most notably known for his role in developing the'universal' computer called the Bombe at Bletchley Park, which decrypted the Nazi enigma machine messages during World War II. However, it was perhaps his (and Alonzo Church's) Church-Turing thesis which suggested that digital computers could simulate any process of formal reasoning, which is most influential in the field of AI today. Such work led to much initial excitement, with a workshop at Dartmouth College being held in the summer of 1956 with many of the most influential computer science academics at the time, such as Marvin Minsky, John McCarthy, Herbert Simon, and Claude Shannon, which led to the founding of artificial intelligence as a field.
The combination of deep learning and decision learning has led to several impressive stories in decision-making AI research, including AIs that can play a variety of games (Atari video games, board games, complex real-time strategy game Starcraft II), control robots (in simulation and in the real world), and even fly a weather balloon. These are examples of sequential decision tasks, in which the AI agent needs to make a sequence of decisions to achieve its goal. Today, the two main approaches for training such agents are reinforcement learning (RL) and imitation learning (IL). In reinforcement learning, humans provide rewards for completing discrete tasks, with the rewards typically being delayed and sparse. For example, 100 points are given for solving the first room of Montezuma's revenge (Fig.1). In the imitation learning setting, humans can transfer knowledge and skills through step-by-step action demonstrations (Fig.2), and the agent then learns to mimic human actions.
Deep reinforcement learning is one of the most interesting branches ofartificial intelligence. It is behind some of the most remarkable achievements of the AI community, including beating human champions at board and video games, self-driving cars, robotics, and AI hardware design. Deep reinforcement learning leverages the learning capacity of deep neural networks to tackle problems that were too complex for classic RL techniques. Deep reinforcement learning is much more complicated than the other branches of machine learning. But in this post, I'll try to demystify it without going into the technical details.
Rich S. Sutton, a research scientist at DeepMind and computing science professor at the University of Alberta, explains the underlying formal problem like the Markov decision processes, core solution methods, dynamic programming, Monte Carlo methods, and temporal-difference learning in this in-depth tutorial.
This is a must read for any practitioner of RL. The book is divided into 3 parts and I would strongly recommend reading through Parts I and II. The sections marked with (*) can be skipped in first reading. And if you click on this, you will see the links of python and Matlab implementations of the examples and exercises contained in the book.
This is a short guide on how to train an AI to play an arbitrary videogame using reinforcement learning. It shows step-by-step how to set up your custom game environment and train the AI utilizing the Stable-Baselines3 library. I wanted to make this guide accessible, so the presented code is not fully optimized. You can find the source on my GitHub. Unlike its supervised and unsupervised counterparts, Reinforcement Learning (RL) is not about our algorithm learning some underlying truth from a static dataset, instead it interacts with its environment to maximize a reward function (quite similar to how animals are trained in real life with treats).
Policy Space Response Oracle method (PSRO) provides a general solution to Nash equilibrium in two-player zero-sum games but suffers from two problems: (1) the computation inefficiency due to consistently evaluating current populations by simulations; and (2) the exploration inefficiency due to learning best responses against a fixed meta-strategy at each iteration. In this work, we propose Efficient PSRO (EPSRO) that largely improves the efficiency of the above two steps. Central to our development is the newly-introduced subroutine of minimax optimization on unrestricted-restricted (URR) games. By solving URR at each step, one can evaluate the current game and compute the best response in one forward pass with no need for game simulations. Theoretically, we prove that the solution procedures of EPSRO offer a monotonic improvement on exploitability. Moreover, a desirable property of EPSRO is that it is parallelizable, this allows for efficient exploration in the policy space that induces behavioral diversity. We test EPSRO on three classes of games, and report a 50x speedup in wall-time, 10x data efficiency, and similar exploitability as existing PSRO methods on Kuhn and Leduc Poker games.