Goto

Collaborating Authors

 finite sample analysis


Finite Sample Analysis Of Dynamic Regression Parameter Learning

Neural Information Processing Systems

We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of applications, existing approaches to this problem either lack guarantees altogether, or only have asymptotic guarantees without explicit rates. In particular, existing literature does not provide any clues to the following fundamental question: In terms of data characteristics, what does the convergence rate depend on? In this paper we study the global system operator -- the operator that maps the noise vectors to the output. We obtain estimates on its spectrum, and as a result derive the first known variance estimators with finite sample complexity guarantees. The proposed bounds depend on the shape of a certain spectrum related to the system operator, and thus provide the first known explicit geometric parameter of the data that can be used to bound estimation errors. In addition, the results hold for arbitrary sub Gaussian distributions of noise terms. We evaluate the approach on synthetic and real-world benchmarks.


Finite Sample Analysis of Average-Reward TD Learning and Q -Learning

Neural Information Processing Systems

The focus of this paper is on sample complexity guarantees of average-reward reinforcement learning algorithms, which are known to be more challenging to study than their discounted-reward counterparts. To the best of our knowledge, we provide the first known finite sample guarantees using both constant and diminishing step sizes of (i) average-reward TD($\lambda$) with linear function approximation for policy evaluation and (ii) average-reward $Q$-learning in the tabular setting to find the optimal policy. A major challenge is that since the value functions are agnostic to an additive constant, the corresponding Bellman operators are no longer contraction mappings under any norm. We obtain the results for TD($\lambda$) by working in an appropriately defined subspace that ensures uniqueness of the solution. For $Q$-learning, we exploit the span seminorm contractive property of the Bellman operator, and construct a novel Lyapunov function obtained by infimal convolution of a generalized Moreau envelope and the indicator function of a set.


Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

Neural Information Processing Systems

In reinforcement learning (RL), one of the key components is policy evaluation, which aims to estimate the value function (i.e., expected long-term accumulated reward) of a policy. With a good policy evaluation method, the RL algorithms will estimate the value function more accurately and find a better policy. When the state space is large or continuous \emph{Gradient-based Temporal Difference(GTD)} policy evaluation algorithms with linear function approximation are widely used. Considering that the collection of the evaluation data is both time and reward consuming, a clear understanding of the finite sample performance of the policy evaluation algorithms is very important to reinforcement learning. Under the assumption that data are i.i.d.


Finite Sample Analysis Of Dynamic Regression Parameter Learning

Neural Information Processing Systems

We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of applications, existing approaches to this problem either lack guarantees altogether, or only have asymptotic guarantees without explicit rates. In particular, existing literature does not provide any clues to the following fundamental question: In terms of data characteristics, what does the convergence rate depend on? In this paper we study the global system operator -- the operator that maps the noise vectors to the output.


Logarithmic Neyman Regret for Adaptive Estimation of the Average Treatment Effect

Neopane, Ojash, Ramdas, Aaditya, Singh, Aarti

arXiv.org Machine Learning

Estimation of the Average Treatment Effect (ATE) is a core problem in causal inference with strong connections to Off-Policy Evaluation in Reinforcement Learning. This paper considers the problem of adaptively selecting the treatment allocation probability in order to improve estimation of the ATE. The majority of prior work on adaptive ATE estimation focus on asymptotic guarantees, and in turn overlooks important practical considerations such as the difficulty of learning the optimal treatment allocation as well as hyper-parameter selection. Existing non-asymptotic methods are limited by poor empirical performance and exponential scaling of the Neyman regret with respect to problem parameters. In order to address these gaps, we propose and analyze the Clipped Second Moment Tracking (ClipSMT) algorithm, a variant of an existing algorithm with strong asymptotic optimality guarantees, and provide finite sample bounds on its Neyman regret. Our analysis shows that ClipSMT achieves exponential improvements in Neyman regret on two fronts: improving the dependence on $T$ from $O(\sqrt{T})$ to $O(\log T)$, as well as reducing the exponential dependence on problem parameters to a polynomial dependence. Finally, we conclude with simulations which show the marked improvement of ClipSMT over existing approaches.


Finite Sample Analysis of Average-Reward TD Learning and Q -Learning

Neural Information Processing Systems

The focus of this paper is on sample complexity guarantees of average-reward reinforcement learning algorithms, which are known to be more challenging to study than their discounted-reward counterparts. To the best of our knowledge, we provide the first known finite sample guarantees using both constant and diminishing step sizes of (i) average-reward TD( \lambda) with linear function approximation for policy evaluation and (ii) average-reward Q -learning in the tabular setting to find the optimal policy. A major challenge is that since the value functions are agnostic to an additive constant, the corresponding Bellman operators are no longer contraction mappings under any norm. We obtain the results for TD( \lambda) by working in an appropriately defined subspace that ensures uniqueness of the solution. For Q -learning, we exploit the span seminorm contractive property of the Bellman operator, and construct a novel Lyapunov function obtained by infimal convolution of a generalized Moreau envelope and the indicator function of a set.


Reviews: Finite sample analysis of the GTD Policy Evaluation Algorithms in Markov Setting

Neural Information Processing Systems

It is well known that the standard TD algorithm widely used in reinforcement learning does not correspond to the gradient of any objective function, and consequently is unstable when combined with any type of function approximation. Despite the success of methods like deep RL, which combines vanilla TD with deep learning, theoretically TD with nonlinear function approximation is demonstrably unstable. Much work on fixing this fundamental flaw in RL has been in vain, till the work on gradient TD methods by Sutton et al. Unfortunately, these methods work, but their analysis was flawed, based on a heuristic derivation of the method. A recent breakthrough by Liu et al. (UAI 2015) showed that gradient TD methods are essentially saddle point methods that are pure gradient methods that optimize not the original gradient TD loss function (which they do not), but rather the saddle point loss function that arises when converting the original loss function into the dual space.


Reviews: A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Neural Information Processing Systems

This work is deals with risk measure in RL/Planning, specifically, risk measures that are based on trajectory variance. Unlike the straight forward approach that is taken in previous works, such as Tamar et al. 2012, 2014 or 2; Prashanth and Ghavamzadeh, 2013, in this work multiple time scale stochastic approximation (MTSSA) is not used. The authors argue that MTSSA is hard to tune and has a slow convergence rate. Instead, the authors propose an approach which is use Block Coordinate Descent. This approach is based on coordinate descent where during the optimization process, not all the coordinates of the policy parameter are optimized, but only a subset of them.



Direct Gradient Temporal Difference Learning

Qian, Xiaochi, Zhang, Shangtong

arXiv.org Artificial Intelligence

Off-policy learning enables a reinforcement learning (RL) agent to reason counterfactually about policies that are not executed and is one of the most important ideas in RL. It, however, can lead to instability when combined with function approximation and bootstrapping, two arguably indispensable ingredients for large-scale reinforcement learning. This is the notorious deadly triad. Gradient Temporal Difference (GTD) is one powerful tool to solve the deadly triad. Its success results from solving a doubling sampling issue indirectly with weight duplication or Fenchel duality. In this paper, we instead propose a direct method to solve the double sampling issue by simply using two samples in a Markovian data stream with an increasing gap. The resulting algorithm is as computationally efficient as GTD but gets rid of GTD's extra weights. The only price we pay is a logarithmically increasing memory as time progresses. We provide both asymptotic and finite sample analysis, where the convergence rate is on-par with the canonical on-policy temporal difference learning. Key to our analysis is a novel refined discretization of limiting ODEs.