Szepesvari, Csaba
Bayesian Optimal Control of Smoothly Parameterized Systems: The Lazy Posterior Sampling Algorithm
Abbasi-Yadkori, Yasin, Szepesvari, Csaba
We study Bayesian optimal control of a general class of smoothly parameterized Markov decision problems. Since computing the optimal control is computationally expensive, we design an algorithm that trades off performance for computational efficiency. The algorithm is a lazy posterior sampling method that maintains a distribution over the unknown parameter. The algorithm changes its policy only when the variance of the distribution is reduced sufficiently. Importantly, we analyze the algorithm and show the precise nature of the performance vs. computation tradeoff. Finally, we show the effectiveness of the method on a web server control application.
Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
Abbasi, Yasin, Bartlett, Peter L., Kanade, Varun, Seldin, Yevgeny, Szepesvari, Csaba
We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves $O(\sqrt{T\log|\Pi|}+\log|\Pi|)$ regret with respect to a comparison set of policies $\Pi$. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set $\Pi$ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. For randomly chosen graphs and adversarial losses, this problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. Finally, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes.
Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
Abbasi-Yadkori, Yasin, Bartlett, Peter L., Szepesvari, Csaba
We study the problem of learning Markov decision processes with finite state and action spaces when the transition probability distributions and loss functions are chosen adversarially and are allowed to change with time. We introduce an algorithm whose regret with respect to any policy in a comparison class grows as the square root of the number of rounds of the game, provided the transition probabilities satisfy a uniform mixing condition. Our approach is efficient as long as the comparison class is polynomial and we can compute expectations over sample paths for each policy. Designing an efficient algorithm with small regret for the general case remains an open problem.
Approximate Policy Iteration with Linear Action Models
Yao, Hengshuai (University of Alberta) | Szepesvari, Csaba (University of Alberta)
In this paper we consider the problem of finding a good policy given some batch data.We propose a new approach, LAM-API, that first builds a so-called linear action model (LAM) from the data and then uses the learned model and the collected data in approximate policy iteration (API) to find a good policy.A natural choice for the policy evaluation step in this algorithm is to use least-squares temporal difference (LSTD) learning algorithm.Empirical results on three benchmark problems show that this particular instance of LAM-API performs competitively as compared with LSPI, both from the point of view of data and computational efficiency.
Statistical Linear Estimation with Penalized Estimators: an Application to Reinforcement Learning
Pires, Bernardo Avila, Szepesvari, Csaba
Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, i.e., problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized estimators, where performance is evaluated using a matrix-weighted two-norm of the defect of the estimator measured with respect to the true, unknown coefficients. Two objective functions are considered depending whether the error of the defect measured with respect to the noisy coefficients is squared or unsquared. We propose simple, yet novel and theoretically well-founded data-dependent choices for the regularization parameters for both cases that avoid data-splitting. A distinguishing feature of our analysis is that we derive deterministic error bounds in terms of the error of the coefficients, thus allowing the complete separation of the analysis of the stochastic properties of these errors. We show that our results lead to new insights and bounds for linear value function estimation in reinforcement learning.
An Adaptive Algorithm for Finite Stochastic Partial Monitoring
Bartok, Gabor, Zolghadr, Navid, Szepesvari, Csaba
We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both "easy" and "hard" problems. For easy problems, it additionally achieves logarithmic individual regret. Most importantly, the algorithm is adaptive in the sense that if the opponent strategy is in an "easy region" of the strategy space then the regret grows as if the problem was easy. As an implication, we show that under some reasonable additional assumptions, the algorithm enjoys an O(\sqrt{T}) regret in Dynamic Pricing, proven to be hard by Bartok et al. (2011).
Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods
Neu, Gergely, Szepesvari, Csaba
In this paper we propose a novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem. The algorithm's aim is to find a reward function such that the resulting optimal policy matches well the expert's observed behavior. The main difficulty is that the mapping from the parameters to policies is both nonsmooth and highly redundant. Resorting to subdifferentials solves the first difficulty, while the second one is over- come by computing natural gradients. We tested the proposed method in two artificial domains and found it to be more reliable and efficient than some previous methods.
Analysis of Kernel Mean Matching under Covariate Shift
Yu, Yaoliang, Szepesvari, Csaba
In real supervised learning scenarios, it is not uncommon that the training and test sample follow different probability distributions, thus rendering the necessity to correct the sampling bias. Focusing on a particular covariate shift problem, we derive high probability confidence bounds for the kernel mean matching (KMM) estimator, whose convergence rate turns out to depend on some regularity measure of the regression function and also on some capacity measure of the kernel. By comparing KMM with the natural plug-in estimator, we establish the superiority of the former hence provide concrete evidence/understanding to the effectiveness of KMM under covariate shift.
Speeding Up Planning in Markov Decision Processes via Automatically Constructed Abstractions
Isaza, Alejandro, Szepesvari, Csaba, Bulitko, Vadim, Greiner, Russell
In this paper, we consider planning in stochastic shortest path (SSP) problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous important applications, such as navigation and planning under uncertainty. We propose a new approach for constructing a multi-level hierarchy of progressively simpler abstractions of the original problem. Once computed, the hierarchy can be used to speed up planning by first finding a policy for the most abstract level and then recursively refining it into a solution to the original problem. This approach is fully automated and delivers a speed-up of two orders of magnitude over a state-of-the-art MDP solver on sample problems while returning near-optimal solutions. We also prove theoretical bounds on the loss of solution optimality resulting from the use of abstractions.
PAC-Bayesian Policy Evaluation for Reinforcement Learning
Fard, Mahdi MIlani, Pineau, Joelle, Szepesvari, Csaba
Bayesian priors offer a compact yet general means of incorporating domain knowledge into many learning tasks. The correctness of the Bayesian analysis and inference, however, largely depends on accuracy and correctness of these priors. PAC-Bayesian methods overcome this problem by providing bounds that hold regardless of the correctness of the prior distribution. This paper introduces the first PAC-Bayesian bound for the batch reinforcement learning problem with function approximation. We show how this bound can be used to perform model-selection in a transfer learning scenario. Our empirical results confirm that PAC-Bayesian policy evaluation is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignore them when they are misleading.