Goto

Collaborating Authors

 Agents


Expressing Arbitrary Reward Functions as Potential-Based Advice

AAAI Conferences

Effectively incorporating external advice is an important problem in reinforcement learning, especially as it moves into the real world. Potential-based reward shaping is a way to provide the agent with a specific form of additional reward, with the guarantee of policy invariance. In this work we give a novel way to incorporate an arbitrary reward function with the same guarantee, by implicitly translating it into the specific form of dynamic advice potentials, which are maintained as an auxiliary value function learnt at the same time. We show that advice provided in this way captures the input reward function in expectation, and demonstrate its efficacy empirically.


Concurrent PAC RL

AAAI Conferences

In many real-world situations a decision maker may make decisions across many separate reinforcement learning tasks in parallel, yet there has been very little work on concurrent RL. Building on the efficient exploration RL literature, we introduce two new concurrent RL algorithms and bound their sample complexity. We show that under some mild conditions, both when the agent is known to be acting in many copies of the same MDP, and when they are not the same but are taken from a finite set, we can gain linear improvements in the sample complexity over not sharing information. This is quite exciting as a linear speedup is the most one might hope to gain. Our preliminary experiments confirm this result and show empirical benefits.


Policy Tree: Adaptive Representation for Policy Gradient

AAAI Conferences

Much of the focus on finding good representations in reinforcement learning has been on learning complex non-linear predictors of value. Policy gradient algorithms, which directly represent the policy, often need fewer parameters to learn good policies. However, they typically employ a fixed parametric representation that may not be sufficient for complex domains. This paper introduces the Policy Tree algorithm, which can learn an adaptive representation of policy in the form of a decision tree over different instantiations of a base policy. Policy gradient is used both to optimize the parameters and to grow the tree by choosing splits that enable the maximum local increase in the expected return of the policy. Experiments show that this algorithm can choose genuinely helpful splits and significantly improve upon the commonly used linear Gibbs softmax policy, which we choose as our base policy.


Unsupervised Cross-Domain Transfer in Policy Gradient Reinforcement Learning via Manifold Alignment

AAAI Conferences

The success of applying policy gradient reinforcement learning (RL) to difficult control tasks hinges crucially on the ability to determine a sensible initialization for the policy. Transfer learning methods tackle this problem by reusing knowledge gleaned from solving other related tasks. In the case of multiple task domains, these algorithms require an inter-task mapping to facilitate knowledge transfer across domains. However, there are currently no general methods to learn an inter-task mapping without requiring either background knowledge that is not typically present in RL settings, or an expensive analysis of an exponential number of inter-task mappings in the size of the state and action spaces. This paper introduces an autonomous framework that uses unsupervised manifold alignment to learn inter-task mappings and effectively transfer samples between different task domains. Empirical results on diverse dynamical systems, including an application to quadrotor control, demonstrate its effectiveness for cross-domain transfer in the context of policy gradient RL.


Finding a Collective Set of Items: From Proportional Multirepresentation to Group Recommendation

AAAI Conferences

We consider the following problem: There is a set of items (e.g., movies) and a group of agents (e.g., passengers on a plane); each agent has some intrinsic utility for each of the items. Our goal is to pick a set of K items that maximize the total derived utility of all the agents (i.e., in our example we are to pick K movies that we put on the plane's entertainment system). However, the actual utility that an agent derives from a given item is only a fraction of its intrinsic one, and this fraction depends on how the agent ranks the item among the chosen, available, ones. We provide a formal specification of the model and provide concrete examples and settings where it is applicable. We show that the problem is hard in general, but we show a number of tractability results for its natural special cases.


Distributing Coalition Value Calculations to Coalition Members

AAAI Conferences

Within characteristic function games, agents have the option of joining one of many different coalitions, based on the utility value of each candidate coalition. However, determining this utility value can be computationally complex since the number of coalitions increases exponentially with the number of agents available. Various approaches have been proposed that mediate this problem by distributing the computational load so that each agent calculates only a subset of coalition values. However, current approaches are either highly inefficient due to redundant calculations, or make the benevolence assumption (i.e. are not suitable for adversarial environments). We introduce DCG, a novel algorithm that distributes the calculations of coalition utility values across a community of agents, such that: (i) no inter-agent communication is required; (ii) the coalition value calculations are (approximately) equally partitioned into shares, one for each agent; (iii) the utility value is calculated only once for each coalition, thus redundant calculations are eliminated; (iv) there is an equal number of operations for agents with equal sized shares; and (v) an agent is only allocated those coalitions in which it is a potential member. The DCG algorithm is presented and illustrated by means of an example. We formally prove that our approach allocates all of the coalitions to the agents, and that each coalition is assigned once and only once.


SCRAM: Scalable Collision-avoiding Role Assignment with Minimal-Makespan for Formational Positioning

AAAI Conferences

Teams of mobile robots often need to divide up subtasks efficiently. In spatial domains, a key criterion for doing so may depend on distances between robots and the subtasks' locations. This paper considers a specific such criterion, namely how to assign interchangeable robots, represented as point masses, to a set of target goal locations within an open two dimensional space such that the makespan (time for all robots to reach their target locations) is minimized while also preventing collisions among robots. We present scaleable (computable in polynomial time) role assignment algorithms that we classify as being SCRAM (Scalable Collision-avoiding Role Assignment with Minimal-makespan). SCRAM role assignment algorithms use a graph theoretic approach to map agents to target goal locations such that our objectives for both minimizing the makespan and avoiding agent collisions are met. A system using SCRAM role assignment was originally designed to allow for decentralized coordination among physically realistic simulated humanoid soccer playing robots in the partially observable, non-deterministic, noisy, dynamic, and limited communication setting of the RoboCup 3D simulation league. In its current form, SCRAM role assignment generalizes well to many realistic and real-world multiagent systems, and scales to thousands of agents.


Generalization Analysis for Game-Theoretic Machine Learning

AAAI Conferences

For Internet applications like sponsored search, cautions need to be taken when using machine learning to optimize their mechanisms (e.g., auction) since self-interested agents in these applications may change their behaviors (and thus the data distribution) in response to the mechanisms. To tackle this problem, a framework called game-theoretic machine learning (GTML) was recently proposed, which first learns a Markov behavior model to characterize agents' behaviors, and then learns the optimal mechanism by simulating agents' behavior changes in response to the mechanism. While GTML has demonstrated practical success, its generalization analysis is challenging because the behavior data are non-i.i.d. and dependent on the mechanism. To address this challenge, first, we decompose the generalization error for GTML into the behavior learning error and the mechanism learning error; second, for the behavior learning error, we obtain novel non-asymptotic error bounds for both parametric and non-parametric behavior learning methods; third, for the mechanism learning error, we derive a uniform convergence bound based on a new concept called \emph{nested covering number} of the mechanism space and the generalization analysis techniques developed for mixing sequences.


A Counter Abstraction Technique for the Verification of Robot Swarms

AAAI Conferences

We study parameterised verification of robot swarms against temporal-epistemic specifications. We relax some of the significant restrictions assumed in the literature and present a counter abstraction approach that enable us to verify a potentially much smaller abstract model when checking a formula on a swarm of any size. We present an implementation and discuss experimental results obtained for the alpha algorithm for robot swarms.


Distributed Multiplicative Weights Methods for DCOP

AAAI Conferences

In this game, each player deal with enormous sizes such as a smart grid is rapidly increasing associated with a variable keeps providing probability distributions in AI communities. The distributed constraint optimization over its domain, and tries to minimize the regret, problem (DCOP for short) is arguably the most which is the average additional cost incurred by the probability studied problem in this setting, where the goal is to find an distributions against the strategy of outputting a best assignment that minimizes the total sum of costs incurred single value all the time. We can make the regret of each by (local) cost functions. Since it takes a prohibitively long agent arbitrarily small by utilizing the multiplicative weights time to exactly solve DCOP, we need to resort to incomplete method. Finally, we round the obtained probability distributions algorithms, and a plethora of incomplete algorithms to integer values. We prove that our method converges have been proposed in the literature, such as local search to a certain kind of equilibrium, called a coarse correlated based algorithms (Maheswaran, Pearce, and Tambe 2004; equilibrium. Zhang et al. 2005), inference based algorithms (Farinelli We empirically compare our methods with previous stateof-the-art et al. 2008), graph based algorithms (Bowring et al. 2008; methods. We demonstrate that our methods are Kiekintveld et al. 2010), divide-and-coordinate based algorithms scalable, and that DMW-Game outperforms other methods (Vinyals, Rodriguez-Aguilar, and Cerquides 2010; in terms of solution quality and efficiency. Hatano and Hirayama 2013), and sampling based algorithms (Ottens, Dimitrakakis, and Faltings 2012; Nguyen, Yeoh, and Lau 2013).