"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.
The figure above illustrates the method:(a) Goal-conditioned RL often fails to reach distant goals, but can successfully reach the goal if starting nearby (inside the green region). Reinforcement learning (RL) has seen a lot of progress over the past few years, tackling increasingly complex tasks. Much of this progress has been enabled by combining existing RL algorithms with powerful function approximators (i.e., neural networks). Function approximators have enabled researchers to apply RL to tasks with high-dimensional inputs without hand-crafting representations, distance metrics, or low-level controllers. However, function approximators have not come for free, and their cost is reflected in notoriously challenging optimization: deep RL algorithms are famously unstable and sensitive to hyperparameters.
Whether it's a dog chasing after a ball, or a monkey swinging through the trees, animals can effortlessly perform an incredibly rich repertoire of agile locomotion skills. But designing controllers that enable legged robots to replicate these agile behaviors can be a very challenging task. The superior agility seen in animals, as compared to robots, might lead one to wonder: can we create more agile robotic controllers with less effort by directly imitating animals? In this work, we present a framework for learning robotic locomotion skills by imitating animals. Given a reference motion clip recorded from an animal (e.g. a dog), our framework uses reinforcement learning to train a control policy that enables a robot to imitate the motion in the real world.
Supply chain and price management were among the first areas of enterprise operations that adopted data science and combinatorial optimization methods and have a long history of using these techniques with great success. Although a wide range of traditional optimization methods are available for inventory and price management applications, deep reinforcement learning has the potential to substantially improve the optimization capabilities for these and other types of enterprise operations due to impressive recent advances in the development of generic self-learning algorithms for optimal control. In this article, we explore how deep reinforcement learning methods can be applied in several basic supply chain and price management scenarios. The traditional price optimization process in retail or manufacturing environments is typically framed as a what-if analysis of different pricing scenarios using some sort of demand model. In many cases, the development of a demand model is challenging because it has to properly capture a wide range of factors and variables that influence demand, including regular prices, discounts, marketing activities, seasonality, competitor prices, cross-product cannibalization, and halo effects. Once the demand model is developed, however, the optimization process for pricing decisions is relatively straightforward, and standard techniques such as linear or integer programming typically suffice. For instance, consider an apparel retailer that purchases a seasonal product at the beginning of the season and has to sell it out by the end of the period. Assuming that a retailer chooses pricing levels from a discrete set (e.g., \$59.90, \$69.90, etc.) and can make price changes frequently (e.g., weekly), we can pose the following optimization problem: The first constraint ensures that each time interval has only one price, and the second constraint ensures that all demands sum up to the available stock level.
DeepMind's breakthroughs in recent years are well documented, and the UK AI company has repeatedly stressed that mastering Go, StarCraft, etc. were not ends in themselves but rather steps toward artificial general intelligence (AGI). DeepMind's latest achievement stays on path: Agent57 is the ultimate gamer, the first deep reinforcement learning (RL) agent to top human baseline scores on all games in the Atari57 test set.
As the Agent interacts with the Environment, it learns a policy. A policy is a "learned strategy" that governs the agents' behaviour in selecting an action at a particular time t of the Environment. A policy can be seen as a mapping from states of an Environment to the actions taken in those states. The goal of the reinforcement Agent is to maximize its long-term rewards as it interacts with the Environment in the feedback configuration. The response the Agent gets from each state-action cycle (where an Agent selects an action from a set of actions at each state of the Environment) is called the reward function.
Good gamers can tune out distractions and unimportant on-screen information and focus their attention on avoiding obstacles and overtaking others in virtual racing games like Mario Kart. However, can machines behave similarly in such vision-based tasks? A possible solution is designing agents that encode and process abstract concepts, and research in this area has focused on learning all abstract information from visual inputs. This however is compute intensive and can even degrade model performance. Now, researchers from Google Brain Tokyo and Google Japan have proposed a novel approach that helps guide reinforcement learning (RL) agents to what's important in vision-based tasks.
A new reinforcement-learning algorithm has learned to optimize the placement of components on a computer chip to make it more efficient and less power-hungry. It requires the careful configuration of hundreds, sometimes thousands, of components across multiple layers in a constrained area. Traditionally, engineers will manually design configurations that minimize the amount of wire used between components as a proxy for efficiency. They then use electronic design automation software to simulate and verify their performance, which can take up to 30 hours for a single floor plan. Time lag: Because of the time investment put into each chip design, chips are traditionally supposed to last between two and five years.
A preprint paper coauthored by Uber AI scientists and Jeff Clune, a research team leader at San Francisco startup OpenAI, describes Fiber, an AI development and distributed training platform for methods including reinforcement learning (which spurs AI agents to complete goals via rewards) and population-based learning. The team says that Fiber expands the accessibility of large-scale parallel computation without the need for specialized hardware or equipment, enabling non-experts to reap the benefits of genetic algorithms in which populations of agents evolve rather than individual members. Fiber -- which was developed to power large-scale parallel scientific computation projects like POET -- is available in open source as of this week, on Github. It supports Linux systems running Python 3.6 and up and Kubernetes running on public cloud environments like Google Cloud, and the research team says that it can scale to hundreds or even thousands of machines. As the researchers point out, increasing computation underlies many recent advances in machine learning, with more and more algorithms relying on distributed training for processing an enormous amount of data.
This repository contains an implementation of distributed reinforcement learning agent where both training and inference are performed on the learner. However, any reinforcement learning environment using the gym API can be used. For a detailed description of the architecture please read our paper. Please cite the paper if you use the code from this repository in your work. There are a few steps you need to take before playing with SEED.