AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Neural Approximate Dynamic Programming for On-Demand Ride-Pooling

Shah, Sanket, Lowalekar, Meghna, Varakantham, Pradeep

arXiv.org Artificial IntelligenceNov-20-2019

On-demand ride-pooling (e.g., UberPool) has recently become popular because of its ability to lower costs for passengers while simultaneously increasing revenue for drivers and aggregation companies. Unlike in Taxi on Demand (ToD) services -- where a vehicle is only assigned one passenger at a time -- in on-demand ride-pooling, each (possibly partially filled) vehicle can be assigned a group of passenger requests with multiple different origin and destination pairs. To ensure near real-time response, existing solutions to the real-time ride-pooling problem are myopic in that they optimise the objective (e.g., maximise the number of passengers served) for the current time step without considering its effect on future assignments. This is because even a myopic assignment in ride-pooling involves considering what combinations of passenger requests that can be assigned to vehicles, which adds a layer of combinatorial complexity to the ToD problem. A popular approach that addresses the limitations of myopic assignments in ToD problems is Approximate Dynamic Programming (ADP). Existing ADP methods for ToD can only handle Linear Program (LP) based assignments, however, while the assignment problem in ride-pooling requires an Integer Linear Program (ILP) with bad LP relaxations. To this end, our key technical contribution is in providing a general ADP method that can learn from ILP-based assignments. Additionally, we handle the extra combinatorial complexity from combinations of passenger requests by using a Neural Network based approximate value function and show a connection to Deep Reinforcement Learning that allows us to learn this value-function with increased stability and sample-efficiency. We show that our approach outperforms past approaches on a real-world dataset by up to 16%, a significant improvement in city-scale transportation problems.

denote, value function, vehicle, (14 more...)

arXiv.org Artificial Intelligence

1911.08842

Country:

Asia > Singapore (0.04)
North America > United States > New York (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.40)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Hierarchical Average Reward Policy Gradient Algorithms

Dharmavaram, Akshay, Riemer, Matthew, Bhatnagar, Shalabh

arXiv.org Artificial IntelligenceNov-20-2019

Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.

agent, equation, tesauro 2018, (14 more...)

arXiv.org Artificial Intelligence

1911.08826

Country:

North America > United States (0.04)
Asia > India > Karnataka > Bengaluru (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.35)

Add feedback

Solving Online Threat Screening Games using Constrained Action Space Reinforcement Learning

Shah, Sanket, Sinha, Arunesh, Varakantham, Pradeep, Perrault, Andrew, Tambe, Milind

arXiv.org Artificial IntelligenceNov-20-2019

Large-scale screening for potential threats with limited resources and capacity for screening is a problem of interest at airports, seaports, and other ports of entry. Adversaries can observe screening procedures and arrive at a time when there will be gaps in screening due to limited resource capacities. To capture this game between ports and adversaries, this problem has been previously represented as a Stackelberg game, referred to as a Threat Screening Game (TSG). Given the significant complexity associated with solving TSGs and uncertainty in arrivals of customers, existing work has assumed that screenees arrive and are allocated security resources at the beginning of the time window. In practice, screenees such as airport passengers arrive in bursts correlated with flight time and are not bound by fixed time windows. To address this, we propose an online threat screening model in which screening strategy is determined adaptively as a passenger arrives while satisfying a hard bound on acceptable risk of not screening a threat. To solve the online problem with a hard bound on risk, we formulate it as a Reinforcement Learning (RL) problem with constraints on the action space (hard bound on risk). We provide a novel way to efficiently enforce linear inequality constraints on the action output in Deep Reinforcement Learning. We show that our solution allows us to significantly reduce screenee wait time while guaranteeing a bound on risk.

average delay, constraint, passenger, (16 more...)

arXiv.org Artificial Intelligence

1911.08799

Country:

Asia > Singapore (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (1.00)

Industry:

Transportation > Passenger (1.00)
Leisure & Entertainment > Games (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

The Reinforcement-Learning Methods that Allow AlphaStar to Outcompete Almost All Human Players at StarCraft II - KDnuggets

#artificialintelligenceNov-19-2019, 18:31:31 GMT

In January, artificial intelligence(AI) powerhouse DeepMind announced it had achieved a major milestone in its journey towards building AI systems that resemble human cognition. AlphaStar was a DeepMind agent designed using reinforcement learning that was able to beat two professional players at a game of StarCraft II, one of the most complex real-time strategy games of all time. During the last few months, DeepMind continued evolving AlphaStar to the point that the AI agent is now able to play a full game of StarCraft II at a Grandmaster level outranking 99.8% of human players. The results were recently published in Nature and they show some of the most advanced self-learning techniques used in modern AI systems. DeepMind's milestone is better explained by illustrating the trajectory from the first version of AlphaStar to the current one as well as some of the key challenges of StarCraft II.

agent, alphastar, starcraft ii, (15 more...)

#artificialintelligence

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DeepMind on Twitter

#artificialintelligenceNov-19-2019, 06:31:30 GMT

Our new @nature paper: AlphaStar is the first learning system to reach the top tier of a major esport without any game restrictions, achieving Grandmaster status in StarCraft II. Researchers have been working on the StarCraft series for over 15 years.

deepmind, twitter

#artificialintelligence

Industry: Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

Add feedback

Bayesian Curiosity for Efficient Exploration in Reinforcement Learning

Blau, Tom, Ott, Lionel, Ramos, Fabio

arXiv.org Machine LearningNov-19-2019

Balancing exploration and exploitation is a fundamental part of reinforcement learning, yet most state-of-the-art algorithms use a naive exploration protocol like $\epsilon$-greedy. This contributes to the problem of high sample complexity, as the algorithm wastes effort by repeatedly visiting parts of the state space that have already been explored. We introduce a novel method based on Bayesian linear regression and latent space embedding to generate an intrinsic reward signal that encourages the learning agent to seek out unexplored parts of the state space. This method is computationally efficient, simple to implement, and can extend any state-of-the-art reinforcement learning algorithm. We evaluate the method on a range of algorithms and challenging control tasks, on both simulated and physical robots, demonstrating how the proposed method can significantly improve sample complexity.

artificial intelligence, reinforcement learning, upstream oil & gas, (19 more...)

arXiv.org Machine Learning

1911.08701

Country: Oceania > Australia (0.14)

Genre: Research Report (0.84)

Industry: Energy > Oil & Gas > Upstream (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)

Add feedback

Evaluating task-agnostic exploration for fixed-batch learning of arbitrary future tasks

Dasagi, Vibhavari, Lee, Robert, Bruce, Jake, Leitner, Jürgen

arXiv.org Machine LearningNov-19-2019

Deep reinforcement learning has been shown to solve challenging tasks where large amounts of training experience is available, usually obtained online while learning the task. Robotics is a significant potential application domain for many of these algorithms, but generating robot experience in the real world is expensive, especially when each task requires a lengthy online training procedure. Off-policy algorithms can in principle learn arbitrary tasks from a diverse enough fixed dataset. In this work, we evaluate popular exploration methods by generating robotics datasets for the purpose of learning to solve tasks completely offline without any further interaction in the real world. We present results on three popular continuous control tasks in simulation, as well as continuous control of a high-dimensional real robot arm. Code documenting all algorithms, experiments, and hyper-parameters is available at https://github.com/qutrobotlearning/batchlearning.

algorithm, exploration, learning, (15 more...)

arXiv.org Machine Learning

1911.08666

Country:

Oceania > Australia > Queensland > Brisbane (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Education > Educational Setting > Online (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen, Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis, Demis, Graepel, Thore, Lillicrap, Timothy, Silver, David

arXiv.org Machine LearningNov-19-2019

Planning algorithms based on lookahead search have achieved remarkable successes in artificial intelligence. Human world champions have been defeated in classic games such as checkers [34], chess [5], Go [38] and poker [3, 26], and planning algorithms have had real-world impact in applications from logistics [47] to chemical synthesis [37]. However, these planning algorithms all rely on knowledge of the environment's dynamics, such as the rules of the game or an accurate simulator, preventing their direct application to real-world domains like robotics, industrial control, or intelligent assistants. Model-based reinforcement learning (RL) [42] aims to address this issue by first learning a model of the environment's dynamics, and then planning with respect to the learned model. Typically, these models have either focused on reconstructing the true environmental state [8, 16, 24], or the sequence of full observations [14, 20]. However, prior work [4, 14, 20] remains far from the state of the art in visually rich domains, such as Atari 2600 games [2]. Instead, the most successful methods are based on model-free RL [9, 21, 18] - i.e. they estimate the optimal policy and/or value function directly from interactions with the environment. However, model-free algorithms are in turn far from the state of the art in domains that require precise and sophisticated lookahead, such as chess and Go. In this paper, we introduce MuZero, a new approach to model-based RL that achieves state-of-the-art performance in Atari 2600, a visually complex set of domains, while maintaining superhuman performance in precision planning tasks such as chess, shogi and Go.

algorithm, muzero, simulation, (15 more...)

arXiv.org Machine Learning

1911.08265

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Genre:

Research Report (1.00)
Workflow (0.93)

Industry:

Leisure & Entertainment > Games > Chess (1.00)
Leisure & Entertainment > Games > Computer Games (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Corruption Robust Exploration in Episodic Reinforcement Learning

Lykouris, Thodoris, Simchowitz, Max, Slivkins, Aleksandrs, Sun, Wen

arXiv.org Artificial IntelligenceNov-19-2019

We initiate the study of multi-stage episodic reinforcement learning under adversarial manipulations in both the rewards and the transition probabilities of the underlying system. Existing efficient algorithms heavily rely on the "optimism under uncertainty" principle which dictates their behavior and does not allow flexibility to perform corruption-robust exploration. We address this by (i) departing from the optimistic behavior, and (ii) creating a general framework that incorporates the principle of action-elimination. (This principle has been essential for corruption-robust exploration in multi-armed bandits, a degenerate special case of episodic reinforcement learning.) Despite constructing a lower bound for a straightforward implementation of action-elimination, we provide a clean and modular way to transfer it to episodic reinforcement learning. Our algorithm enjoys near-optimal guarantees in the absence of adversarial manipulations, has performance that degrades gracefully as the amount of corruption increases, and does not need to know this amount. Our results shed new light on the broader question of robust exploration, and suggest a way to address a rather daunting mismatch between optimistic algorithms and algorithms with higher flexibility. To demonstrate the applicability of our framework, we provide a second instantiation thereof, showing how it can provide efficient guarantees for the stochastic setting, despite doing almost uniform exploration across plausibly optimal actions.

algorithm, corruption, probability, (15 more...)

arXiv.org Artificial Intelligence

1911.08689

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.70)

Industry:

Leisure & Entertainment (1.00)
Media > Television (0.45)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Efficient decorrelation of features using Gramian in Reinforcement Learning

Mavrin, Borislav, Graves, Daniel, Chan, Alan

arXiv.org Artificial IntelligenceNov-19-2019

Learning good representations is a long standing problem in reinforcement learning (RL). One of the conventional ways to achieve this goal in the supervised setting is through regularization of the parameters. Extending some of these ideas to the RL setting has not yielded similar improvements in learning. In this paper, we develop an online regularization framework for decorrelating features in RL and demonstrate its utility in several test environments. We prove that the proposed algorithm converges in the linear function approximation setting and does not change the main objective of maximizing cumulative reward. We demonstrate how to scale the approach to deep RL using the Gramian of the features achieving linear computational complexity in the number of features and squared complexity in size of the batch. We conduct an extensive empirical study of the new approach on Atari 2600 games and show a significant improvement in sample efficiency in 40 out of 49 games.

equation, matrix, representation, (15 more...)

arXiv.org Artificial Intelligence

1911.0861

Country:

North America > Canada > Alberta (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback