Goto

Collaborating Authors

 Markov Models


Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) has achieved several remarkable empirical successes in the last decade, which include playing Atari 2600 video games at superhuman levels (Mnih et al., 2015), AlphaGo or AlphaGo Zero surpassing champions in Go (Silver et al., 2018), AlphaStar's victory over top-ranked professional players in StarCraft (Vinyals et al., 2019), or practical self-driving cars. These applications all correspond to the setting of rich observations, where the state space is very large and where observations may be images, text or audio data. In contrast, most provably efficient RL algorithms are still limited to the classical tabular setting where the state space is small (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002; Azar et al., 2017; Dann et al., 2019) and do not scale to the rich observation setting. To derive guarantees for large state spaces, much of the existing work in RL theory relies on a realizability and a low-rank assumption (Krishnamurthy et al., 2016; Jiang et al., 2017; Dann et al., 2018; Du et al., 2019a; Misra et al., 2020; Agarwal et al., 2020b). Different notions of rank have been adopted in the literature, including that of a low-rank transition matrix (Jin et al., 2020a), a low Bellman rank (Jiang et al., 2017), Wittness rank (Sun et al., 2019), Eluder dimension (Osband and Van Roy, 2014), Bellman-Eluder dimension (Jin et al., 2021), or bilinear classes (Du et al., 2021).


Vehicle Trajectory Prediction in City-scale Road Networks using a Direction-based Sequence-to-Sequence Model with Spatiotemporal Attention Mechanisms

arXiv.org Artificial Intelligence

Trajectory prediction of vehicles at the city scale is of great importance to various location-based applications such as vehicle navigation, traffic management, and location-based recommendations. Existing methods typically represent a trajectory as a sequence of grid cells, road segments or intention sets. None of them is ideal, as the cell-based representation ignores the road network structures and the other two are less efficient in analyzing city-scale road networks. In addition, most models focus on predicting the immediate next position, and are difficult to generalize for longer sequences. To address these problems, we propose a novel sequence-to-sequence model named D-LSTM (Direction-based Long Short-Term Memory), which represents each trajectory as a sequence of intersections and associated movement directions, and then feeds them into a LSTM encoder-decoder network for future trajectory generation. Furthermore, we introduce a spatial attention mechanism to capture dynamic spatial dependencies in road networks, and a temporal attention mechanism with a sliding context window to capture both short- and long-term temporal dependencies in trajectory data. Extensive experiments based on two real-world large-scale taxi trajectory datasets show that D-LSTM outperforms the existing state-of-the-art methods for vehicle trajectory prediction, validating the effectiveness of the proposed trajectory representation method and spatiotemporal attention mechanisms.


On the Importance of Environments in Human-Robot Coordination

arXiv.org Artificial Intelligence

When studying robots collaborating with humans, much of the focus has been on robot policies that coordinate fluently with human teammates in collaborative tasks. However, less emphasis has been placed on the effect of the environment on coordination behaviors. To thoroughly explore environments that result in diverse behaviors, we propose a framework for procedural generation of environments that are (1) stylistically similar to human-authored environments, (2) guaranteed to be solvable by the human-robot team, and (3) diverse with respect to coordination measures. We analyze the procedurally generated environments in the Overcooked benchmark domain via simulation and an online user study. Results show that the environments result in qualitatively different emerging behaviors and statistically significant differences in collaborative fluency metrics, even when the robot runs the same planning algorithm.


A Max-Min Entropy Framework for Reinforcement Learning

arXiv.org Artificial Intelligence

In this paper, we propose a max-min entropy framework for reinforcement learning (RL) to overcome the limitation of the maximum entropy RL framework in model-free sample-based learning. Whereas the maximum entropy RL framework guides learning for policies to reach states with high entropy in the future, the proposed max-min entropy framework aims to learn to visit states with low entropy and maximize the entropy of these low-entropy states to promote exploration. For general Markov decision processes (MDPs), an efficient algorithm is constructed under the proposed max-min entropy framework based on disentanglement of exploration and exploitation. Numerical results show that the proposed algorithm yields drastic performance improvement over the current state-of-the-art RL algorithms.


Accelerated Policy Evaluation: Learning Adversarial Environments with Adaptive Importance Sampling

arXiv.org Artificial Intelligence

The evaluation of rare but high-stakes events remains one of the main difficulties in obtaining reliable policies from intelligent agents, especially in large or continuous state/action spaces where limited scalability enforces the use of a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. In this paper, we propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence properties of proposed algorithms under suitable regularity conditions. Our empirical studies show that APE estimates rare event probability with a smaller variance while only using orders of magnitude fewer samples compared to baseline methods in both multi-agent and single-agent environments.


Learning the Preferences of Uncertain Humans with Inverse Decision Theory

arXiv.org Artificial Intelligence

Existing observational approaches for learning human preferences, such as inverse reinforcement learning, usually make strong assumptions about the observability of the human's environment. However, in reality, people make many important decisions under uncertainty. To better understand preference learning in these cases, we study the setting of inverse decision theory (IDT), a previously proposed framework where a human is observed making non-sequential binary decisions under uncertainty. In IDT, the human's preferences are conveyed through their loss function, which expresses a tradeoff between different types of mistakes. We give the first statistical analysis of IDT, providing conditions necessary to identify these preferences and characterizing the sample complexity -- the number of decisions that must be observed to learn the tradeoff the human is making to a desired precision. Interestingly, we show that it is actually easier to identify preferences when the decision problem is more uncertain. Furthermore, uncertain decision problems allow us to relax the unrealistic assumption that the human is an optimal decision maker but still identify their exact preferences; we give sample complexities in this suboptimal case as well. Our analysis contradicts the intuition that partial observability should make preference learning more difficult. It also provides a first step towards understanding and improving preference learning methods for uncertain and suboptimal humans.


On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

arXiv.org Artificial Intelligence

Batch reinforcement learning (RL) broadly refers to the problem of finding a policy with high expected return in a stochastic control problem when only a batch of data collected from the controlled system is available. Here, we consider this problem for finite state-action ("tabular") Markov decision processes (MDPs) when the data takes the form of trajectories obtained by following some logging policy. In more details, the trajectories are composed of sequences of states, actions, and rewards, where the action is chosen by the logging policy and the next state and rewards follow the distributions specified by the MDP's transition parameters. Arguably, this is the most natural problem setting to consider in batch learning. The basic questions are (a) what logging policy should one choose to generate the data so as to maximize the chance of obtaining a good policy with as little data as possible; and (b) how many transitions are sufficient and necessary to obtain a good policy and which algorithm to use to obtain such a policy? Note that here the logging policy needs to be chosen a priori, that is, before the data collection process begins. An alternative way of saying this is that the data collection is done in a passive way. Our main results are as follows: First, we show that (perhaps unsurprisingly), in the lack of extra information about the nature of the MDP, the best logging policy should choose actions uniformly at random. Next, we show that the number of transitions necessary and sufficient to obtain a good policy, the sample complexity of learning, is an exponential function of the minimum of the number of states and the planning horizon.


Many Agent Reinforcement Learning Under Partial Observability

arXiv.org Artificial Intelligence

Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep reinforcement learning, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity, which leads to permutation invariance of joint actions, to two recently presented deep MARL algorithms, MADDPG and IA2C, and compare these instantiations to another recent technique that leverages action anonymity, viz., mean-field MARL. We show that our instantiations can learn the optimal behavior in a broader class of agent networks than the mean-field method, using a recently introduced pragmatic domain.


Disentangling Identifiable Features from Noisy Data with Structured Nonlinear ICA

arXiv.org Machine Learning

We introduce a new general identifiable framework for principled disentanglement referred to as Structured Nonlinear Independent Component Analysis (SNICA). Our contribution is to extend the identifiability theory of deep generative models for a very broad class of structured models. While previous works have shown identifiability for specific classes of time-series models, our theorems extend this to more general temporal structures as well as to models with more complex structures such as spatial dependencies. In particular, we establish the major result that identifiability for this framework holds even in the presence of noise of unknown distribution. The SNICA setting therefore subsumes all the existing nonlinear ICA models for time-series and also allows for new much richer identifiable models. Finally, as an example of our framework's flexibility, we introduce the first nonlinear ICA model for time-series that combines the following very useful properties: it accounts for both nonstationarity and autocorrelation in a fully unsupervised setting; performs dimensionality reduction; models hidden states; and enables principled estimation and inference by variational maximum-likelihood.


Cooperative Multi-Agent Reinforcement Learning Based Distributed Dynamic Spectrum Access in Cognitive Radio Networks

arXiv.org Artificial Intelligence

This work has been submitted to the IEEE for possible publication. Abstract With the development of the 5G and Internet of Things, amounts of wireless devices need to share the limited spectrum resources. Dynamic spectrum access (DSA) is a promising paradigm to remedy the problem of inef!cient spectrum utilization brought upon by the historical command-and-control approach to spectrum allocation. In this paper, we investigate the distributed DSA problem for multiuser in a typical multi-channel cognitive radio network. The problem is formulated as a decentralized partially observable Markov decision process (Dec-POMDP), and we proposed a centralized off-line training and distributed on-line execution framework based on cooperative multi-agent reinforcement learning (MARL). We employ the deep recurrent Q-network (DRQN) to address the partial observability of the state for each cognitive user. The ultimate goal is to learn a cooperative strategy which maximizes the sum throughput of cognitive radio network in distributed fashion without coordination information exchange between cognitive users. This work was supported in part by the National Natural Science Foundation of China under Grant 6193000305. X. Tan, L. Zhou, Y. Sun, H. Wang, H. Zhao and J. Wei are all with College of Electronic Science and Technology, National University of Defense Technology, Changsha, 410073, China (E-mail: {tanxiang, zhouli2035, haijunwang14, sunyuli19, haitaozhao, wjbhw}@nudt.edu.cn). Boon-Chong Seet is with the Department of Electrical and Electronic Engineering, Auckland University of Technology, Auckland 1142, New Zealand (E-mail: boon-chong.seet@aut.ac.nz). Victor C. M. Leung is with Shenzhen University, Shenzhen, China and the University of British Columbia, Vancouver, Canada (E-mail: vleung@ieee.org). 2 From the simulation results, we can observe that the proposed algorithm can converge fast and achieve almost the optimal performance. The future network is involving into the Internet of Everything.