AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Learning Invariant Representations for Reinforcement Learning without Reconstruction

Zhang, Amy, McAllister, Rowan, Calandra, Roberto, Gal, Yarin, Levine, Sergey

arXiv.org Artificial IntelligenceJun-18-2020

We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Our goal is to learn representations that both provide for effective downstream control and invariance to task-irrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the task-relevant information from observations. Our method trains encoders such that distances in latent space equal bisimulation distances in state space. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks, where the background is replaced with moving distractors and natural videos, while achieving SOTA performance. We also test a first-person highway driving task where our method learns invariance to clouds, weather, and time of day. Finally, we provide generalization results drawn from properties of bisimulation metrics, and links to causal inference.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

arXiv.org Artificial Intelligence

2006.10742

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (0.48)
Automobiles & Trucks (0.48)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Competitive Policy Optimization

Prajapat, Manish, Azizzadenesheli, Kamyar, Liniger, Alexander, Yue, Yisong, Anandkumar, Anima

arXiv.org Machine LearningJun-18-2020

A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. To tackle this, we propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture interactions among the players. We instantiate CoPO in two ways:(i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods.

agent, approximation, copg, (14 more...)

arXiv.org Machine Learning

2006.10611

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > Jordan (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
(7 more...)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment > Games (1.00)
Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(3 more...)

Add feedback

Efficient Ridesharing Dispatch Using Multi-Agent Reinforcement Learning

de Lima, Oscar, Shah, Hansal, Chu, Ting-Sheng, Fogelson, Brian

arXiv.org Machine LearningJun-18-2020

With the advent of ride-sharing services, there is a huge increase in the number of people who rely on them for various needs. Most of the earlier approaches tackling this issue required handcrafted functions for estimating travel times and passenger waiting times. Traditional Reinforcement Learning (RL) based methods attempting to solve the ridesharing problem are unable to accurately model the complex environment in which taxis operate. Prior Multi-Agent Deep RL based methods based on Independent DQN (IDQN) learn decentralized value functions prone to instability due to the concurrent learning and exploring of multiple agents. Our proposed method based on QMIX is able to achieve centralized training with decentralized execution. We show that our model performs better than the IDQN baseline on a fixed grid size and is able to generalize well to smaller or larger grid sizes. Also, our algorithm is able to outperform IDQN baseline in the scenario where we have a variable number of passengers and cars in each episode. Code for our paper is publicly available at: https://github.com/UMich-ML-Group/RL-Ridesharing.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

2006.10897

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.15)

Genre: Research Report (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Provably adaptive reinforcement learning in metric spaces

Cao, Tongyi, Krishnamurthy, Akshay

arXiv.org Machine LearningJun-18-2020

We study reinforcement learning in continuous state and action spaces endowed with a metric. We provide a refined analysis of the algorithm of Sinclair, Banerjee, and Yu (2019) and show that its regret scales with the \emph{zooming dimension} of the instance. This parameter, which originates in the bandit literature, captures the size of the subsets of near optimal actions and is always smaller than the covering dimension used in previous analyses. As such, our results are the first provably adaptive guarantees for reinforcement learning in metric spaces.

dimension, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2006.10875

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.63)

Add feedback

Reparameterized Variational Divergence Minimization for Stable Imitation

Arumugam, Dilip, Dey, Debadeepta, Agarwal, Alekh, Celikyilmaz, Asli, Nouri, Elnaz, Dolan, Bill

arXiv.org Machine LearningJun-18-2020

While recent state-of-the-art results for adversarial imitation-learning algorithms are encouraging, recent works exploring the imitation learning from observation (ILO) setting, where trajectories \textit{only} contain expert observations, have not been met with the same success. Inspired by recent investigations of $f$-divergence manipulation for the standard imitation learning setting(Ke et al., 2019; Ghasemipour et al., 2019), we here examine the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms. We unfortunately find that $f$-divergence minimization through reinforcement learning is susceptible to numerical instabilities. We contribute a reparameterization trick for adversarial imitation learning to alleviate the optimization challenges of the promising $f$-divergence minimization framework. Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

arXiv.org Machine Learning

2006.1081

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Washington > King County > Redmond (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.64)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Distributed Value Function Approximation for Collaborative Multi-Agent Reinforcement Learning

Stankovic, Milos S., Beko, Marko, Stankovic, Srdjan S.

arXiv.org Machine LearningJun-18-2020

In this paper we propose novel distributed gradient-based temporal difference algorithms for multi-agent off-policy learning of linear approximation of the value function in Markov decision processes. The algorithms are composed of: 1) local parameter updates based on the single-agent off-policy gradient temporal difference learning algorithms, including eligibility traces with state dependent parameters, and 2) linear dynamic consensus scheme over the underlying, typically sparsely connected, inter-agent communication network. The proposed algorithms differ in the way of how the time-scales are selected, how local recursions are performed and how consensus iterations are incorporated. The algorithms are completely decentralized, allowing applications in which all the agents may have completely different behavior policies while evaluating a single target policy. In this sense, the algorithms may be considered as a tool for either parallelization or multi-agent collaborative learning under given constraints. We provide weak convergence results, taking rigorously into account properties of the underlying Feller-Markov processes. We prove that, under nonrestrictive assumptions on the time-varying network topology and the individual state-visiting distributions of the agents, the parameter estimates of the algorithms weakly converge to a consensus point. The variance reduction effect of the proposed algorithms is demonstrated by analyzing a limiting stochastic differential equation. Specific guidelines for network design, providing the desired convergence points, are given. The algorithms' properties are illustrated by characteristic simulation results.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2006.10443

Country:

North America > Canada > Alberta (0.14)
Europe > Serbia > Central Serbia > Belgrade (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.86)

Add feedback

DREAM: Deep Regret minimization with Advantage baselines and Model-free learning

Steinberger, Eric, Lerer, Adam, Brown, Noam

arXiv.org Machine LearningJun-18-2020

We introduce DREAM, a deep reinforcement learning algorithm that finds optimal strategies in imperfect-information games with multiple agents. Formally, DREAM converges to a Nash Equilibrium in two-player zero-sum games and to an extensive-form coarse correlated equilibrium in all other games. Our primary innovation is an effective algorithm that, in contrast to other regret-based deep learning algorithms, does not require access to a perfect simulator of the game to achieve good performance. We show that DREAM empirically achieves state-of-the-art performance among model-free algorithms in popular benchmark games, and is even competitive with algorithms that do use a perfect simulator.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

2006.1041

Country:

North America > Canada > Alberta (0.14)
North America > United States > Texas (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Time-Variant Variational Transfer for Value Functions

Canonaco, Giuseppe, Soprani, Andrea, Roveri, Manuel, Restelli, Marcello

arXiv.org Machine LearningJun-18-2020

In most of the transfer learning approaches to reinforcement learning (RL) the distribution over the tasks is assumed to be stationary. Therefore, the target and source tasks are i.i.d. samples of the same distribution. In the context of this work, we consider the problem of transferring value functions through a variational method when the distribution that generates the tasks is time-variant, proposing a solution that leverages this temporal structure inherent in the task generating process. Furthermore, by means of a finite-sample analysis, the previously mentioned solution is theoretically compared to its time-invariant version. Finally, we will provide an experimental evaluation of the proposed technique with three distinct temporal dynamics in three different RL environments.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2005.12864

Country: Europe > Italy > Lombardy > Milan (0.04)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Cooperative Multi-Agent Reinforcement Learning with Partial Observations

Zhang, Yan, Zavlanos, Michael M.

arXiv.org Artificial IntelligenceJun-18-2020

In this paper, we propose a distributed zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL). Existing MARL algorithms often assume that every agent can observe the states and actions of all the other agents in the network. This can be impractical in large-scale problems, where sharing the state and action information with multi-hop neighbors may incur significant communication overhead. The advantage of the proposed zeroth-order policy optimization method is that it allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards that depend on partial state and action information only and can be obtained using consensus. Specifically, to calculate the local policy gradients, we develop a new distributed zeroth-order policy gradient estimator that relies on one-point residual-feedback which, compared to existing zeroth-order estimators that also rely on one-point feedback, significantly reduces the variance of the policy gradient estimates improving, in this way, the learning performance. We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to a neighborhood of the global optimal policy that depends on the number of consensus steps used to calculate the local estimates of the global accumulated rewards. Moreover, we provide numerical experiments that demonstrate that our new zeroth-order policy gradient estimator is more sample-efficient compared to other existing one-point estimators.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2006.10822

Country: