Olshevsky, Alex
Geometric Re-Analysis of Classical MDP Solving Algorithms
Mustafin, Arsenii, Pakharev, Aleksei, Olshevsky, Alex, Paschalidis, Ioannis Ch.
We extend a recently introduced geometric interpretation of Markov Decision Processes (MDPs) that provides a new perspective on MDP algorithms and their dynamics. Based on this view, we develop a novel analytical framework that simplifies the proofs of existing results and enables us to derive new ones. Specifically, we analyze the behavior of two classical MDP-solving algorithms: Policy Iteration (PI) and Value Iteration (VI). For each algorithm, we first describe its dynamics in geometric terms and then present an analysis along with several convergence results. We begin by introducing an MDP transformation that modifies the discount factor ฮณ and demonstrate how this transformation improves the convergence properties of both algorithms, provided that it can be applied such that the resulting system remains a regular MDP. Second, we present a new analysis of PI in a 2-state MDP case, showing that the number of iterations required for convergence is bounded by the number of state-action pairs. Finally, we reveal an additional convergence factor in the VI algorithm for cases with a connected optimal policy, which is attributed to an extra rotation component in the VI dynamics.
Sample Complexity of Linear Quadratic Regulator Without Initial Stability
Moghaddam, Amirreza Neshaei, Olshevsky, Alex, Gharesifard, Bahman
Inspired by REINFORCE, we introduce a novel receding-horiz on algorithm for the Linear Quadratic Regulator (LQR) problem with unknown parameters. Un like prior methods, our algorithm avoids reliance on two-point gradient estimates while maintainin g the same order of sample complexity. Furthermore, it eliminates the restrictive requirement of starting with a stable initial policy, broadening its applicability. Beyond these improvements, we introduce a refined analysis of e rror propagation through the contraction of the Riemannian distance over the Riccati operator. This refinem ent leads to a better sample complexity and ensures improved convergence guarantees.
Analysis of Value Iteration Through Absolute Probability Sequences
Mustafin, Arsenii, Colla, Sebastien, Olshevsky, Alex, Paschalidis, Ioannis Ch.
Value Iteration is a widely used algorithm for solving Markov Decision Processes (MDPs). While previous studies have extensively analyzed its convergence properties, they primarily focus on convergence with respect to the infinity norm. In this work, we use absolute probability sequences to develop a new line of analysis and examine the algorithm's convergence in terms of the $L^2$ norm, offering a new perspective on its behavior and performance.
MDP Geometry, Normalization and Value Free Solvers
Mustafin, Arsenii, Pakharev, Aleksei, Olshevsky, Alex, Paschalidis, Ioannis Ch.
Markov Decision Process (MDP) is a common mathematical model for sequential decision-making problems. In this paper, we present a new geometric interpretation of MDP, which is useful for analyzing the dynamics of main MDP algorithms. Based on this interpretation, we demonstrate that MDPs can be split into equivalence classes with indistinguishable algorithm dynamics. The related normalization procedure allows for the design of a new class of MDP-solving algorithms that find optimal policies without computing policy values.
One-Shot Averaging for Distributed TD($\lambda$) Under Markov Sampling
Tian, Haoxing, Paschalidis, Ioannis Ch., Olshevsky, Alex
Actor-critic method achieves state-of-the-art performance in many domains including robotics, game playing, and control systems (LeCun et al. (2015); Mnih et al. (2016); Silver et al. (2017)). Temporal Difference (TD) Learning may be thought of as a component of actor critic, and better bounds for TD Learning are usually ingredients of actor-critic analyses. We consider the problem of policy evaluation in reinforcement learning: given a Markov Decision Process (MDP) and a policy, we need to estimate the value of each state (expected discounted sum of all future rewards) under this policy. Policy evaluation is important because it is effectively a subroutine of many other algorithms such as policy iteration and actor-critic. The main challenges for policy evaluation are that we usually do not know the underlying MDP directly and can only interact with it, and that the number of states is typically too large forcing us to maintain a low-dimensional approximation of the true vector of state values.
Sample Complexity of the Linear Quadratic Regulator: A Reinforcement Learning Lens
Moghaddam, Amirreza Neshaei, Olshevsky, Alex, Gharesifard, Bahman
We provide the first known algorithm that provably achieves $\varepsilon$-optimality within $\widetilde{\mathcal{O}}(1/\varepsilon)$ function evaluations for the discounted discrete-time LQR problem with unknown parameters, without relying on two-point gradient estimates. These estimates are known to be unrealistic in many settings, as they depend on using the exact same initialization, which is to be selected randomly, for two different policies. Our results substantially improve upon the existing literature outside the realm of two-point gradient estimates, which either leads to $\widetilde{\mathcal{O}}(1/\varepsilon^2)$ rates or heavily relies on stability assumptions.
Convex SGD: Generalization Without Early Stopping
Hendrickx, Julien, Olshevsky, Alex
We consider the generalization error associated with stochastic gradient descent on a smooth convex function over a compact set. We show the first bound on the generalization error that vanishes when the number of iterations $T$ and the dataset size $n$ go to zero at arbitrary rates; our bound scales as $\tilde{O}(1/\sqrt{T} + 1/\sqrt{n})$ with step-size $\alpha_t = 1/\sqrt{t}$. In particular, strong convexity is not needed for stochastic gradient descent to generalize well.
On the Performance of Temporal Difference Learning With Neural Networks
Tian, Haoxing, Paschalidis, Ioannis Ch., Olshevsky, Alex
Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto $B(\theta_0, \omega)$, a ball of fixed radius $\omega$ around the initial point $\theta_0$. We show an approximation bound of $O(\epsilon) + \tilde{O} (1/\sqrt{m})$ where $\epsilon$ is the approximation quality of the best neural network in $B(\theta_0, \omega)$ and $m$ is the width of all hidden layers in the network.
Closing the gap between SVRG and TD-SVRG with Gradient Splitting
Mustafin, Arsenii, Olshevsky, Alex, Paschalidis, Ioannis Ch.
Temporal difference (TD) learning is a policy evaluation in reinforcement learning whose performance can be enhanced by variance reduction techniques. Recently, multiple works have sought to fuse TD learning with SVRG to obtain a policy evaluation method with a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. Our main result is a geometric convergence bound with predetermined learning rate of $1/8$, which is identical to the convergence bound available for SVRG in the convex setting. Our theoretical findings are supported by a set of experiments.