AITopics

A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. To tackle this, we propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture interactions among the players. We instantiate CoPO in two ways:(i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods.

agent, approximation, copg, (14 more...)

2006.10611

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > Jordan (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
(7 more...)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment > Games (1.00)
Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(3 more...)

de Lima, Oscar, Shah, Hansal, Chu, Ting-Sheng, Fogelson, Brian

Efficient Ridesharing Dispatch Using Multi-Agent Reinforcement Learning

With the advent of ride-sharing services, there is a huge increase in the number of people who rely on them for various needs. Most of the earlier approaches tackling this issue required handcrafted functions for estimating travel times and passenger waiting times. Traditional Reinforcement Learning (RL) based methods attempting to solve the ridesharing problem are unable to accurately model the complex environment in which taxis operate. Prior Multi-Agent Deep RL based methods based on Independent DQN (IDQN) learn decentralized value functions prone to instability due to the concurrent learning and exploring of multiple agents. Our proposed method based on QMIX is able to achieve centralized training with decentralized execution. We show that our model performs better than the IDQN baseline on a fixed grid size and is able to generalize well to smaller or larger grid sizes. Also, our algorithm is able to outperform IDQN baseline in the scenario where we have a variable number of passengers and cars in each episode. Code for our paper is publicly available at: https://github.com/UMich-ML-Group/RL-Ridesharing.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2006.10897

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.15)

Genre: Research Report (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Stankovic, Milos S., Beko, Marko, Stankovic, Srdjan S.

Distributed Value Function Approximation for Collaborative Multi-Agent Reinforcement Learning

In this paper we propose novel distributed gradient-based temporal difference algorithms for multi-agent off-policy learning of linear approximation of the value function in Markov decision processes. The algorithms are composed of: 1) local parameter updates based on the single-agent off-policy gradient temporal difference learning algorithms, including eligibility traces with state dependent parameters, and 2) linear dynamic consensus scheme over the underlying, typically sparsely connected, inter-agent communication network. The proposed algorithms differ in the way of how the time-scales are selected, how local recursions are performed and how consensus iterations are incorporated. The algorithms are completely decentralized, allowing applications in which all the agents may have completely different behavior policies while evaluating a single target policy. In this sense, the algorithms may be considered as a tool for either parallelization or multi-agent collaborative learning under given constraints. We provide weak convergence results, taking rigorously into account properties of the underlying Feller-Markov processes. We prove that, under nonrestrictive assumptions on the time-varying network topology and the individual state-visiting distributions of the agents, the parameter estimates of the algorithms weakly converge to a consensus point. The variance reduction effect of the proposed algorithms is demonstrated by analyzing a limiting stochastic differential equation. Specific guidelines for network design, providing the desired convergence points, are given. The algorithms' properties are illustrated by characteristic simulation results.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2006.10443

Country:

North America > Canada > Alberta (0.14)
Europe > Serbia > Central Serbia > Belgrade (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.86)

Steinberger, Eric, Lerer, Adam, Brown, Noam

DREAM: Deep Regret minimization with Advantage baselines and Model-free learning

We introduce DREAM, a deep reinforcement learning algorithm that finds optimal strategies in imperfect-information games with multiple agents. Formally, DREAM converges to a Nash Equilibrium in two-player zero-sum games and to an extensive-form coarse correlated equilibrium in all other games. Our primary innovation is an effective algorithm that, in contrast to other regret-based deep learning algorithms, does not require access to a perfect simulator of the game to achieve good performance. We show that DREAM empirically achieves state-of-the-art performance among model-free algorithms in popular benchmark games, and is even competitive with algorithms that do use a perfect simulator.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2006.1041

Country:

North America > Canada > Alberta (0.14)
North America > United States > Texas (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Zhang, Yan, Zavlanos, Michael M.

Cooperative Multi-Agent Reinforcement Learning with Partial Observations

arXiv.org Artificial IntelligenceJun-18-2020

In this paper, we propose a distributed zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL). Existing MARL algorithms often assume that every agent can observe the states and actions of all the other agents in the network. This can be impractical in large-scale problems, where sharing the state and action information with multi-hop neighbors may incur significant communication overhead. The advantage of the proposed zeroth-order policy optimization method is that it allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards that depend on partial state and action information only and can be obtained using consensus. Specifically, to calculate the local policy gradients, we develop a new distributed zeroth-order policy gradient estimator that relies on one-point residual-feedback which, compared to existing zeroth-order estimators that also rely on one-point feedback, significantly reduces the variance of the policy gradient estimates improving, in this way, the learning performance. We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to a neighborhood of the global optimal policy that depends on the number of consensus steps used to calculate the local estimates of the global accumulated rewards. Moreover, we provide numerical experiments that demonstrate that our new zeroth-order policy gradient estimator is more sample-efficient compared to other existing one-point estimators.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2006.10822

Country:

North America > United States > North Carolina > Durham County > Durham (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.50)

Verma, Arun, Hanawal, Manjesh K.

Stochastic Network Utility Maximization with Unknown Utilities: Multi-Armed Bandits Approach

arXiv.org Machine LearningJun-17-2020

In this paper, we study a novel Stochastic Network Utility Maximization (NUM) problem where the utilities of agents are unknown. The utility of each agent depends on the amount of resource it receives from a network operator/controller. The operator desires to do a resource allocation that maximizes the expected total utility of the network. We consider threshold type utility functions where each agent gets non-zero utility if the amount of resource it receives is higher than a certain threshold. Otherwise, its utility is zero (hard real-time). We pose this NUM setup with unknown utilities as a regret minimization problem. Our goal is to identify a policy that performs as `good' as an oracle policy that knows the utilities of agents. We model this problem setting as a bandit setting where feedback obtained in each round depends on the resource allocated to the agents. We propose algorithms for this novel setting using ideas from Multiple-Play Multi-Armed Bandits and Combinatorial Semi-Bandits. We show that the proposed algorithm is optimal when all agents have the same utility. We validate the performance guarantees of our proposed algorithms through numerical experiments.

allocation equivalent, artificial intelligence, data mining, (16 more...)

2006.09997

Country: Asia > India (0.14)

Genre: Research Report (1.00)

Industry: Telecommunications (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Liebman, Elad, Stone, Peter

Artificial Musical Intelligence: A Survey

arXiv.org Artificial IntelligenceJun-17-2020

Computers have been used to analyze and create music since they were first introduced in the 1950s and 1960s. Beginning in the late 1990s, the rise of the Internet and large scale platforms for music recommendation and retrieval have made music an increasingly prevalent domain of machine learning and artificial intelligence research. While still nascent, several different approaches have been employed to tackle what may broadly be referred to as "musical intelligence." This article provides a definition of musical intelligence, introduces a taxonomy of its constituent components, and surveys the wide range of AI methods that can be, and have been, brought to bear in its pursuit, with a particular emphasis on machine learning methods.

evolutionary algorithm, machine learning, reinforcement learning, (20 more...)

2006.10553

Country:

North America > United States > Texas > Travis County > Austin (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Middle East > Jordan (0.04)
(8 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Education (0.93)
Health & Medicine (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(8 more...)

arXiv.org Artificial IntelligenceJun-17-2020

Logic, Probability and Action: A Situation Calculus Perspective

Belle, Vaishak

The unification of logic and probability is a long-standing concern in AI, and more generally, in the philosophy of science. In essence, logic provides an easy way to specify properties that must hold in every possible world, and probability allows us to further quantify the weight and ratio of the worlds that must satisfy a property. To that end, numerous developments have been undertaken, culminating in proposals such as probabilistic relational models. While this progress has been notable, a general-purpose first-order knowledge representation language to reason about probabilities and dynamics, including in continuous settings, is still to emerge. In this paper, we survey recent results pertaining to the integration of logic, probability and actions in the situation calculus, which is arguably one of the oldest and most well-known formalisms. We then explore reduction theorems and programming interfaces for the language. These results are motivated in the context of cognitive robotics (as envisioned by Reiter and his colleagues) for the sake of concreteness. Overall, the advantage of proving results for such a general language is that it becomes possible to adapt them to any special-purpose fragment, including but not limited to popular probabilistic relational models.

artificial intelligence, bayesian inference, logic & formal reasoning, (14 more...)

2006.09868

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(2 more...)

Jeong, Heejin, Hassani, Hamed, Morari, Manfred, Lee, Daniel D., Pappas, George J.

Learning to Track Dynamic Targets in Partially Known Environments

arXiv.org Machine LearningJun-17-2020

We solve active target tracking, one of the essential tasks in autonomous systems, using a deep reinforcement learning (RL) approach. In this problem, an autonomous agent is tasked with acquiring information about targets of interests using its onboard sensors. The classical challenges in this problem are system model dependence and the difficulty of computing information-theoretic cost functions for a long planning horizon. RL provides solutions for these challenges as the length of its effective planning horizon does not affect the computational complexity, and it drops the strong dependency of an algorithm on system models. In particular, we introduce Active Tracking Target Network (ATTN), a unified RL policy that is capable of solving major sub-tasks of active target tracking -- in-sight tracking, navigation, and exploration. The policy shows robust behavior for tracking agile and anomalous targets with a partially known target model. Additionally, the same policy is able to navigate in obstacle environments to reach distant targets as well as explore the environment when targets are positioned in unexpected locations.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2006.1019

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)
Europe > Switzerland > Zürich > Zürich (0.04)
(10 more...)

Genre:

Research Report (0.64)
Personal (0.46)

Industry:

Education (0.68)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(2 more...)

Chaplot, Devendra Singh, Jiang, Helen, Gupta, Saurabh, Gupta, Abhinav

Semantic Curiosity for Active Visual Learning

arXiv.org Artificial IntelligenceJun-16-2020

In this paper, we study the task of embodied interactive learning for object detection. Given a set of environments (and some labeling budget), our goal is to learn an object detector by having an agent select what data to obtain labels for. How should an exploration policy decide which trajectory should be labeled? One possibility is to use a trained object detector's failure cases as an external reward. However, this will require labeling millions of frames required for training RL policies, which is infeasible. Instead, we explore a self-supervised approach for training our exploration policy by introducing a notion of semantic curiosity. Our semantic curiosity policy is based on a simple observation -- the detection outputs should be consistent. Therefore, our semantic curiosity rewards trajectories with inconsistent labeling behavior and encourages the exploration policy to explore such areas. The exploration policy trained via semantic curiosity generalizes to novel scenes and helps train an object detector that outperforms baselines trained with other possible alternatives such as random exploration, prediction-error curiosity, and coverage-maximizing exploration.

artificial intelligence, exploration policy, machine learning, (11 more...)

2006.09367

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Illinois (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Education > Educational Setting > Online (0.48)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)