Goto

Collaborating Authors

 Undirected Networks


On the Importance of Environments in Human-Robot Coordination

arXiv.org Artificial Intelligence

When studying robots collaborating with humans, much of the focus has been on robot policies that coordinate fluently with human teammates in collaborative tasks. However, less emphasis has been placed on the effect of the environment on coordination behaviors. To thoroughly explore environments that result in diverse behaviors, we propose a framework for procedural generation of environments that are (1) stylistically similar to human-authored environments, (2) guaranteed to be solvable by the human-robot team, and (3) diverse with respect to coordination measures. We analyze the procedurally generated environments in the Overcooked benchmark domain via simulation and an online user study. Results show that the environments result in qualitatively different emerging behaviors and statistically significant differences in collaborative fluency metrics, even when the robot runs the same planning algorithm.


A Max-Min Entropy Framework for Reinforcement Learning

arXiv.org Artificial Intelligence

In this paper, we propose a max-min entropy framework for reinforcement learning (RL) to overcome the limitation of the maximum entropy RL framework in model-free sample-based learning. Whereas the maximum entropy RL framework guides learning for policies to reach states with high entropy in the future, the proposed max-min entropy framework aims to learn to visit states with low entropy and maximize the entropy of these low-entropy states to promote exploration. For general Markov decision processes (MDPs), an efficient algorithm is constructed under the proposed max-min entropy framework based on disentanglement of exploration and exploitation. Numerical results show that the proposed algorithm yields drastic performance improvement over the current state-of-the-art RL algorithms.


Accelerated Policy Evaluation: Learning Adversarial Environments with Adaptive Importance Sampling

arXiv.org Artificial Intelligence

The evaluation of rare but high-stakes events remains one of the main difficulties in obtaining reliable policies from intelligent agents, especially in large or continuous state/action spaces where limited scalability enforces the use of a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. In this paper, we propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence properties of proposed algorithms under suitable regularity conditions. Our empirical studies show that APE estimates rare event probability with a smaller variance while only using orders of magnitude fewer samples compared to baseline methods in both multi-agent and single-agent environments.


Learning the Preferences of Uncertain Humans with Inverse Decision Theory

arXiv.org Artificial Intelligence

Existing observational approaches for learning human preferences, such as inverse reinforcement learning, usually make strong assumptions about the observability of the human's environment. However, in reality, people make many important decisions under uncertainty. To better understand preference learning in these cases, we study the setting of inverse decision theory (IDT), a previously proposed framework where a human is observed making non-sequential binary decisions under uncertainty. In IDT, the human's preferences are conveyed through their loss function, which expresses a tradeoff between different types of mistakes. We give the first statistical analysis of IDT, providing conditions necessary to identify these preferences and characterizing the sample complexity -- the number of decisions that must be observed to learn the tradeoff the human is making to a desired precision. Interestingly, we show that it is actually easier to identify preferences when the decision problem is more uncertain. Furthermore, uncertain decision problems allow us to relax the unrealistic assumption that the human is an optimal decision maker but still identify their exact preferences; we give sample complexities in this suboptimal case as well. Our analysis contradicts the intuition that partial observability should make preference learning more difficult. It also provides a first step towards understanding and improving preference learning methods for uncertain and suboptimal humans.


On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

arXiv.org Artificial Intelligence

Batch reinforcement learning (RL) broadly refers to the problem of finding a policy with high expected return in a stochastic control problem when only a batch of data collected from the controlled system is available. Here, we consider this problem for finite state-action ("tabular") Markov decision processes (MDPs) when the data takes the form of trajectories obtained by following some logging policy. In more details, the trajectories are composed of sequences of states, actions, and rewards, where the action is chosen by the logging policy and the next state and rewards follow the distributions specified by the MDP's transition parameters. Arguably, this is the most natural problem setting to consider in batch learning. The basic questions are (a) what logging policy should one choose to generate the data so as to maximize the chance of obtaining a good policy with as little data as possible; and (b) how many transitions are sufficient and necessary to obtain a good policy and which algorithm to use to obtain such a policy? Note that here the logging policy needs to be chosen a priori, that is, before the data collection process begins. An alternative way of saying this is that the data collection is done in a passive way. Our main results are as follows: First, we show that (perhaps unsurprisingly), in the lack of extra information about the nature of the MDP, the best logging policy should choose actions uniformly at random. Next, we show that the number of transitions necessary and sufficient to obtain a good policy, the sample complexity of learning, is an exponential function of the minimum of the number of states and the planning horizon.


Many Agent Reinforcement Learning Under Partial Observability

arXiv.org Artificial Intelligence

Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep reinforcement learning, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity, which leads to permutation invariance of joint actions, to two recently presented deep MARL algorithms, MADDPG and IA2C, and compare these instantiations to another recent technique that leverages action anonymity, viz., mean-field MARL. We show that our instantiations can learn the optimal behavior in a broader class of agent networks than the mean-field method, using a recently introduced pragmatic domain.


Disentangling Identifiable Features from Noisy Data with Structured Nonlinear ICA

arXiv.org Machine Learning

We introduce a new general identifiable framework for principled disentanglement referred to as Structured Nonlinear Independent Component Analysis (SNICA). Our contribution is to extend the identifiability theory of deep generative models for a very broad class of structured models. While previous works have shown identifiability for specific classes of time-series models, our theorems extend this to more general temporal structures as well as to models with more complex structures such as spatial dependencies. In particular, we establish the major result that identifiability for this framework holds even in the presence of noise of unknown distribution. The SNICA setting therefore subsumes all the existing nonlinear ICA models for time-series and also allows for new much richer identifiable models. Finally, as an example of our framework's flexibility, we introduce the first nonlinear ICA model for time-series that combines the following very useful properties: it accounts for both nonstationarity and autocorrelation in a fully unsupervised setting; performs dimensionality reduction; models hidden states; and enables principled estimation and inference by variational maximum-likelihood.


Cooperative Multi-Agent Reinforcement Learning Based Distributed Dynamic Spectrum Access in Cognitive Radio Networks

arXiv.org Artificial Intelligence

This work has been submitted to the IEEE for possible publication. Abstract With the development of the 5G and Internet of Things, amounts of wireless devices need to share the limited spectrum resources. Dynamic spectrum access (DSA) is a promising paradigm to remedy the problem of inef!cient spectrum utilization brought upon by the historical command-and-control approach to spectrum allocation. In this paper, we investigate the distributed DSA problem for multiuser in a typical multi-channel cognitive radio network. The problem is formulated as a decentralized partially observable Markov decision process (Dec-POMDP), and we proposed a centralized off-line training and distributed on-line execution framework based on cooperative multi-agent reinforcement learning (MARL). We employ the deep recurrent Q-network (DRQN) to address the partial observability of the state for each cognitive user. The ultimate goal is to learn a cooperative strategy which maximizes the sum throughput of cognitive radio network in distributed fashion without coordination information exchange between cognitive users. This work was supported in part by the National Natural Science Foundation of China under Grant 6193000305. X. Tan, L. Zhou, Y. Sun, H. Wang, H. Zhao and J. Wei are all with College of Electronic Science and Technology, National University of Defense Technology, Changsha, 410073, China (E-mail: {tanxiang, zhouli2035, haijunwang14, sunyuli19, haitaozhao, wjbhw}@nudt.edu.cn). Boon-Chong Seet is with the Department of Electrical and Electronic Engineering, Auckland University of Technology, Auckland 1142, New Zealand (E-mail: boon-chong.seet@aut.ac.nz). Victor C. M. Leung is with Shenzhen University, Shenzhen, China and the University of British Columbia, Vancouver, Canada (E-mail: vleung@ieee.org). 2 From the simulation results, we can observe that the proposed algorithm can converge fast and achieve almost the optimal performance. The future network is involving into the Internet of Everything.


Seeing Differently, Acting Similarly: Imitation Learning with Heterogeneous Observations

arXiv.org Artificial Intelligence

In many real-world imitation learning tasks, the demonstrator and the learner have to act in different but full observation spaces. This situation generates significant obstacles for existing imitation learning approaches to work, even when they are combined with traditional space adaptation techniques. The main challenge lies in bridging expert's occupancy measures to learner's dynamically changing occupancy measures under the different observation spaces. In this work, we model the above learning problem as Heterogeneous Observations Imitation Learning (HOIL). We propose the Importance Weighting with REjection (IWRE) algorithm based on the techniques of importance-weighting, learning with rejection, and active querying to solve the key challenge of occupancy measure matching. Experimental results show that IWRE can successfully solve HOIL tasks, including the challenging task of transforming the vision-based demonstrations to random access memory (RAM)-based policies under the Atari domain.


Minimax Estimation of Partially-Observed Vector AutoRegressions

arXiv.org Machine Learning

To understand the behavior of large dynamical systems like transportation networks, one must often rely on measurements transmitted by a set of sensors, for instance individual vehicles. Such measurements are likely to be incomplete and imprecise, which makes it hard to recover the underlying signal of interest.Hoping to quantify this phenomenon, we study the properties of a partially-observed state-space model. In our setting, the latent state $X$ follows a high-dimensional Vector AutoRegressive process $X_t = \theta X_{t-1} + \varepsilon_t$. Meanwhile, the observations $Y$ are given by a noise-corrupted random sample from the state $Y_t = \Pi_t X_t + \eta_t$. Several random sampling mechanisms are studied, allowing us to investigate the effect of spatial and temporal correlations in the distribution of the sampling matrices $\Pi_t$.We first prove a lower bound on the minimax estimation error for the transition matrix $\theta$. We then describe a sparse estimator based on the Dantzig selector and upper bound its non-asymptotic error, showing that it achieves the optimal convergence rate for most of our sampling mechanisms. Numerical experiments on simulated time series validate our theoretical findings, while an application to open railway data highlights the relevance of this model for public transport traffic analysis.