Goto

Collaborating Authors

 Undirected Networks


Near-Optimal Reinforcement Learning with Self-Play

arXiv.org Artificial Intelligence

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with $S$ states, $A$ max-player actions and $B$ min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires $\tilde{\mathcal{O}}(S^2AB)$ steps of game playing, when only highlighting the dependency on $(S,A,B)$. In contrast, the best existing lower bound scales as $\Omega(S(A+B))$ and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the \emph{Nash Q-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(S(A+B))$. The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.


A Provably Efficient Sample Collection Strategy for Reinforcement Learning

arXiv.org Machine Learning

A common assumption in reinforcement learning (RL) is to have access to a generative model (i.e., a simulator of the environment), which allows to generate samples from any desired state-action pair. Nonetheless, in many settings a generative model may not be available and an adaptive exploration strategy is needed to efficiently collect samples from an unknown environment by direct interaction. In this paper, we study the scenario where an algorithm based on the generative model assumption defines the (possibly time-varying) amount of samples $b(s,a)$ required at each state-action pair $(s,a)$ and an exploration strategy has to learn how to generate $b(s,a)$ samples as fast as possible. Building on recent results for regret minimization in the stochastic shortest path (SSP) setting (Cohen et al., 2020; Tarbouriech et al., 2020), we derive an algorithm that requires $\tilde{O}( B D + D^{3/2} S^2 A)$ time steps to collect the $B = \sum_{s,a} b(s,a)$ desired samples, in any unknown and communicating MDP with $S$ states, $A$ actions and diameter $D$. Leveraging the generality of our strategy, we readily apply it to a variety of existing settings (e.g., model estimation, pure exploration in MDPs) for which we obtain improved sample-complexity guarantees, and to a set of new problems such as best-state identification and sparse reward discovery.


Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation

arXiv.org Machine Learning

We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting. Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of \ofulq and cast it into a constrained \textit{extended} LQR problem, where an additional control variable implicitly selects the system dynamics within a confidence interval. We then move to the corresponding Lagrangian formulation for which we prove strong duality. As a result, we show that an $\epsilon$-optimistic controller can be computed efficiently by solving at most $O\big(\log(1/\epsilon)\big)$ Riccati equations. Finally, we prove that relaxing the original \ofu problem does not impact the learning performance, thus recovering the $\tilde{O}(\sqrt{T})$ regret of \ofulq. To the best of our knowledge, this is the first computationally efficient confidence-based algorithm for LQR with worst-case optimal regret guarantees.


Predicting Sequences of Traversed Nodes in Graphs using Network Models with Multiple Higher Orders

arXiv.org Machine Learning

We propose a novel sequence prediction method for sequential data capturing node traversals in graphs. Our method builds on a statistical modelling framework that combines multiple higher-order network models into a single multi-order model. We develop a technique to fit such multi-order models in empirical sequential data and to select the optimal maximum order. Our framework facilitates both next-element and full sequence prediction given a sequence-prefix of any length. We evaluate our model based on six empirical data sets containing sequences from website navigation as well as public transport systems. The results show that our method out-performs state-of-the-art algorithms for next-element prediction. We further demonstrate the accuracy of our method during out-of-sample sequence prediction and validate that our method can scale to data sets with millions of sequences.


Efficient Planning in Large MDPs with Weak Linear Function Approximation

arXiv.org Machine Learning

Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.


zx

#artificialintelligence

Edit: If you want to see MarkovComposer in action, but you don't want to mess with Java code, you can access a web version of it here. In the following article, I'll present some of the research I've been working on lately. Algorithms, or algorithmic composition, have been used to compose music for centuries. For example, Western punctus contra punctum can be sometimes reduced to algorithmic determinacy. Then, why not use fast-learning computers capable of billions of calculations per second to do what they do best, to follow algorithms?


Adversarial jamming attacks and defense strategies via adaptive deep reinforcement learning

arXiv.org Artificial Intelligence

As the applications of deep reinforcement learning (DRL) in wireless communications grow, sensitivity of DRL based wireless communication strategies against adversarial attacks has started to draw increasing attention. In order to address such sensitivity and alleviate the resulting security concerns, we in this paper consider a victim user that performs DRL-based dynamic channel access, and an attacker that executes DRLbased jamming attacks to disrupt the victim. Hence, both the victim and attacker are DRL agents and can interact with each other, retrain their models, and adapt to opponents' policies. In this setting, we initially develop an adversarial jamming attack policy that aims at minimizing the accuracy of victim's decision making on dynamic channel access. Subsequently, we devise defense strategies against such an attacker, and propose three defense strategies, namely diversified defense with proportional-integral-derivative (PID) control, diversified defense with an imitation attacker, and defense via orthogonal policies. We design these strategies to maximize the attacked victim's accuracy and evaluate their performances.


Learning Abstract Models for Strategic Exploration and Fast Reward Transfer

arXiv.org Artificial Intelligence

Model-based reinforcement learning (RL) is appealing because (i) it enables planning and thus more strategic exploration, and (ii) by decoupling dynamics from rewards, it enables fast transfer to new reward functions. However, learning an accurate Markov Decision Process (MDP) over high-dimensional states (e.g., raw pixels) is extremely challenging because it requires function approximation, which leads to compounding errors. Instead, to avoid compounding errors, we propose learning an abstract MDP over abstract states: low-dimensional coarse representations of the state (e.g., capturing agent position, ignoring other objects). We assume access to an abstraction function that maps the concrete states to abstract states. In our approach, we construct an abstract MDP, which grows through strategic exploration via planning. Similar to hierarchical RL approaches, the abstract actions of the abstract MDP are backed by learned subpolicies that navigate between abstract states. Our approach achieves strong results on three of the hardest Arcade Learning Environment games (Montezuma's Revenge, Pitfall!, and Private Eye), including superhuman performance on Pitfall! without demonstrations. After training on one task, we can reuse the learned abstract MDP for new reward functions, achieving higher reward in 1000x fewer samples than model-free methods trained from scratch.


A Survey on Autonomous Vehicle Control in the Era of Mixed-Autonomy: From Physics-Based to AI-Guided Driving Policy Learning

arXiv.org Artificial Intelligence

This paper serves as an introduction and overview of the potentially useful models and methodologies from artificial intelligence (AI) into the field of transportation engineering for autonomous vehicle (AV) control in the era of mixed autonomy. We will discuss state-of-the-art applications of AI-guided methods, identify opportunities and obstacles, raise open questions, and help suggest the building blocks and areas where AI could play a role in mixed autonomy. We divide the stage of autonomous vehicle (AV) deployment into four phases: the pure HVs, the HV-dominated, the AVdominated, and the pure AVs. This paper is primarily focused on the latter three phases. It is the first-of-its-kind survey paper to comprehensively review literature in both transportation engineering and AI for mixed traffic modeling. Models used for each phase are summarized, encompassing game theory, deep (reinforcement) learning, and imitation learning. While reviewing the methodologies, we primarily focus on the following research questions: (1) What scalable driving policies are to control a large number of AVs in mixed traffic comprised of human drivers and uncontrollable AVs? (2) How do we estimate human driver behaviors? (3) How should the driving behavior of uncontrollable AVs be modeled in the environment? (4) How are the interactions between human drivers and autonomous vehicles characterized? Hopefully this paper will not only inspire our transportation community to rethink the conventional models that are developed in the data-shortage era, but also reach out to other disciplines, in particular robotics and machine learning, to join forces towards creating a safe and efficient mixed traffic ecosystem.


Improved Analysis of UCRL2 with Empirical Bernstein Inequality

arXiv.org Machine Learning

We consider the problem of exploration-exploitation in communicating Markov Decision Processes. We provide an analysis of UCRL2 with Empirical Bernstein inequalities (UCRL2B). For any MDP with $S$ states, $A$ actions, $\Gamma \leq S$ next states and diameter $D$, the regret of UCRL2B is bounded as $\widetilde{O}(\sqrt{D\Gamma S A T})$.