finite time analysis
Review for NeurIPS paper: Decentralized TD Tracking with Linear Function Approximation and its Finite-Time Analysis
This paper is theoretical work that provides finite time analysis for decentralised TD learning. The reviewers and myself, although not anonymously, think this contribution may be significant and interesting to the community due to recent interest in the finite time analysis of TD algorithms and (linear) function approximation. We request the authors to address the changes required in the manuscript. The authors propose a distributed method for safety. The reviewers and myself were not convinced that this paper proposes a novel method, specifically, due to lack of proper comparison to previous work.
Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP
Sangadi, Tejaram, Prashanth, L. A., Jagannathan, Krishna
In the standard reinforcement learning (RL) setting, the objective is to learn a policy that maximizes the value function, which is the expectation of the cumulative reward that is obtained over a finite or infinite time horizon. However, in several practical scenarios including finance, automated driving and drug testing, a risk sensitive learning paradigm assumes importance, wherein the value function, which is an expectation, needs to be traded off suitably with an appropriate risk metric associated with the reward distribution. One way to achieve this is to solve a constrained optimization problem with this risk metric as a constraint, and the value function as the objective. Variance is a popular risk measure, which is usually incorporated into a risk-sensitive optimization problem as a constraint, with the usual expected value as the objective. Such a mean-variance formulation was studied in the seminal work of Markowitz [10]. In the context of RL, mean-variance optimization has been considered in several previous works, cf.
Finite Time Analysis of Constrained Actor Critic and Constrained Natural Actor Critic Algorithms
Panda, Prashansa, Bhatnagar, Shalabh
Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $\Vert \nabla L(\theta,\gamma)\Vert_2^2 \leq \epsilon$) of the performance (Lagrange) function $L(\theta,\gamma)$, with a sample complexity of $\mathcal{\tilde{O}}(\epsilon^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms.We also show the results of experiments on a few different grid world settings and observe good empirical performance using both of these algorithms. In particular, for large grid sizes, Constrained Natural Actor Critic shows slightly better results than Constrained Actor Critic while the latter is slightly better for a small grid size.
A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation
Bhandari, Jalaj, Russo, Daniel, Singal, Raghav
Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a simple and explicit finite time analysis of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. A final section of the paper shows that all of our main results extend to the study of Q-learning applied to high-dimensional optimal stopping problems.
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.61)