Goto

Collaborating Authors

 tsitsiklis



Reinforcement Learning in POMDP's via Direct Gradient Ascent

Baxter, Jonathan, Bartlett, Peter L.

arXiv.org Artificial Intelligence

This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter $β\in [0,1)$, which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.



Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

Chen, Suei-Wen, Ross, Keith, Youssef, Pierre

arXiv.org Artificial Intelligence

Monte Carlo Exploring Starts (MCES), which aims to learn the optimal policy using only sample returns, is a simple and natural algorithm in reinforcement learning which has been shown to converge under various conditions. However, the convergence rate analysis for MCES-style algorithms in the form of sample complexity has received very little attention. In this paper we develop a finite sample bound for a modified MCES algorithm which solves the stochastic shortest path problem. To this end, we prove a novel result on the convergence rate of the policy iteration algorithm. This result implies that with probability at least $1-\delta$, the algorithm returns an optimal policy after $\tilde{O}(SAK^3\log^3\frac{1}{\delta})$ sampled episodes, where $S$ and $A$ denote the number of states and actions respectively, $K$ is a proxy for episode length, and $\tilde{O}$ hides logarithmic factors and constants depending on the rewards of the environment that are assumed to be known.


On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation

Winnicki, Anna, Srikant, R.

arXiv.org Artificial Intelligence

A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.


On the Convergence of Discounted Policy Gradient Methods

Nota, Chris

arXiv.org Artificial Intelligence

Policy gradient methods are a class of reinforcement learning (RL) algorithms that attempt to directly maximize the expected performance of an agent's policy by following the gradient of an objective function (Sutton et al., 2000), typically the expected sum of rewards, using a stochastic estimator generated by interacting with the environment. Unbiased estimators of this gradient can suffer from high variance due to high variance in the sum of future rewards. A common approach is to instead consider an exponentially discounted sum of future rewards. This approach reduces the variance of most estimators but introduces bias (Thomas, 2014). Frequently, the discounted sum of future rewards is estimated by a critic (Konda and Tsitsiklis, 2000). It has been argued that when a critic is used, discounting has the additional benefit of reducing approximation error (Zhang et al., 2020). The "discounted" policy gradient was originally introduced as the gradient of a discounted objective (Sutton et al., 2000). However, it has been shown that the gradient of the discounted objective does not produce the update direction followed by most discounted policy gradient algorithms (Thomas, 2014; Nota and Thomas, 2019).


A Concentration Bound for Distributed Stochastic Approximation

Dolhare, Harsh, Borkar, Vivek

arXiv.org Artificial Intelligence

We revisit the classical model of Tsitsiklis, Bertsekas and Athans for distributed stochastic approximation with consensus. The main result is an analysis of this scheme using the ODE approach to stochastic approximation, leading to a high probability bound for the tracking error between suitably interpolated iterates and the limiting differential equation. Several future directions will also be highlighted.


Average-Reward Reinforcement Learning with Trust Region Methods

Ma, Xiaoteng, Tang, Xiaohang, Xia, Li, Yang, Jun, Zhao, Qianchuan

arXiv.org Artificial Intelligence

Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.


Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

Holzleitner, Markus, Gruber, Lukas, Arjona-Medina, José, Brandstetter, Johannes, Hochreiter, Sepp

arXiv.org Artificial Intelligence

We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural networks of arbitrary complexity. Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning. Previous convergence proofs assume linear function approximation, cannot treat episodic examples, or do not consider that policies become greedy. The latter is relevant since optimal policies are typically deterministic.


Average-reward model-free reinforcement learning: a systematic review and literature mapping

Dewanto, Vektor, Dunn, George, Eshragh, Ali, Gallagher, Marcus, Roosta, Fred

arXiv.org Artificial Intelligence

Model-free reinforcement learning (RL) has been an active area of research and provides a fundamental framework for agent-based learning and decision-making in artificial intelligence. In this paper, we review a specific subset of this literature, namely work that utilizes optimization criteria based on average rewards, in the infinite horizon setting. Average reward RL has the advantage of being the most selective criterion in recurrent (ergodic) Markov decision processes. In comparison to widely-used discounted reward criterion, it also requires no discount factor, which is a critical hyperparameter, and properly aligns the optimization and performance metrics. Motivated by the solo survey by Mahadevan (1996a), we provide an updated review of work in this area and extend it to cover policy-iteration and function approximation methods (in addition to the value-iteration and tabular counterparts). We also identify and discuss opportunities for future work.