policy iteration algorithm
2dace78f80bc92e6d7493423d729448e-Reviews.html
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. It presents a slight modification of the NAC algorithm, where the original algorithm is a special case which is called forgetful NAC. The authors show that forget full Nac and optimistic policy iteration are equivalent. The authors also present a non-optimality result for soft-greedy Gibbs distribution, I.e., the optimal solution is not a fixed point of the policy iteration algorithm. I liked the unified view on both type of algorithms.
Geometric Re-Analysis of Classical MDP Solving Algorithms
Mustafin, Arsenii, Pakharev, Aleksei, Olshevsky, Alex, Paschalidis, Ioannis Ch.
We extend a recently introduced geometric interpretation of Markov Decision Processes (MDPs) that provides a new perspective on MDP algorithms and their dynamics. Based on this view, we develop a novel analytical framework that simplifies the proofs of existing results and enables us to derive new ones. Specifically, we analyze the behavior of two classical MDP-solving algorithms: Policy Iteration (PI) and Value Iteration (VI). For each algorithm, we first describe its dynamics in geometric terms and then present an analysis along with several convergence results. We begin by introducing an MDP transformation that modifies the discount factor ฮณ and demonstrate how this transformation improves the convergence properties of both algorithms, provided that it can be applied such that the resulting system remains a regular MDP. Second, we present a new analysis of PI in a 2-state MDP case, showing that the number of iterations required for convergence is bounded by the number of state-action pairs. Finally, we reveal an additional convergence factor in the VI algorithm for cases with a connected optimal policy, which is attributed to an extra rotation component in the VI dynamics.
fd5c905bcd8c3348ad1b35d7231ee2b1-Reviews.html
This paper is published in the context of making Learning from Demonstration more robust when a limited number of demonstrations are available. Many of the low level trajectory learning LfD approaches suffer from fragile policies. This paper proposes to use Reinforcement learning to overcome this limitation. This paper falls squarely in the LfD field and does not tackle Inverse reinforcement learning, i.e. the reward function is assumed to be known to the agent rather than inferred by demonstration. One work with a very similar flavor is that of Smart, W. and Kaelbling, L.P. "Effective Reinfrocement Learning for Mobile Robots" ICRA 2002.
Quantum Computing Methods for Supply Chain Management
Jiang, Hansheng, Shen, Zuo-Jun Max, Liu, Junyu
Quantum computing is expected to have transformative influences on many domains, but its practical deployments on industry problems are underexplored. We focus on applying quantum computing to operations management problems in industry, and in particular, supply chain management. Many problems in supply chain management involve large state and action spaces and pose computational challenges on classic computers. We develop a quantized policy iteration algorithm to solve an inventory control problem and demonstrative its effectiveness. We also discuss in-depth the hardware requirements and potential challenges on implementing this quantum algorithm in the near term. Our simulations and experiments are powered by \texttt{IBM Qiskit} and the \texttt{qBraid} system.
Model-Free Robust Reinforcement Learning with Linear Function Approximation
Panaganti, Kishan, Kalathil, Dileep
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. The goal of the RMDPs framework is to find a policy that is robust against the parameter uncertainties due to the mismatch between the simulator model and real-world settings. We first propose Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We prove the convergence of this algorithm using stochastic approximation techniques. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy. We also give a general weighted Euclidean norm bound on the error (closeness to optimality) of the resulting policy. Finally, we demonstrate the performance of our RLSPI algorithm on some benchmark problems from OpenAI Gym.
On the convergence of optimistic policy iteration for stochastic shortest path problem
In this paper, we prove some convergence results of a special case of optimistic policy iteration algorithm for stochastic shortest path problem mentioned in [5] . We consider both Monte Carlo and TD(ฮป) methods for the policy evaluation step under the condition that termination state will eventually be reached almost surely.
Dissecting Reinforcement Learning-Part.1
Premise[This post is an introduction to reinforcement learning and it is meant to be the starting point for a reader who already has some machine learning background and is confident with a little bit of math and Python. When I study a new algorithm I always want to understand the underlying mechanisms. In this sense it is always useful to implement the algorithm from scratch using a programming language. I followed this approach in this post which can be long to read but worthy. When I started to study reinforcement learning I did not find any good online resource which explained from the basis what reinforcement learning really is. Most of the (very good) blogs out there focus on the modern approaches (Deep Reinforcement Learning) and introduce the Bellman equation without a satisfying explanation. I turned my attention to books and I found the one of Russel and Norvig called Artificial Intelligence: A Modern Approach. This post is based on chapters 17 of the second edition, and it can be considered an extended review of the chapter. I will use the same mathematical notation of the authors, in this way you can use the book to cover some missing parts or vice versa.
Solving POMDPs by Searching in Policy Space
Most algorithms for solving POMDPs iteratively improve a value function that implicitly represents a policy and are said to search in value function space. This paper presents an approach to solving POMDPs that represents a policy explicitly as a finite-state controller and iteratively improves the controller by search in policy space. Two related algorithms illustrate this approach. The first is a policy iteration algorithm that can outperform value iteration in solving infinitehorizon POMDPs. It provides the foundation for a new heuristic search algorithm that promises further speedup by focusing computational effort on regions of the problem space that are reachable, or likely to be reached, from a start state.
On the Complexity of Policy Iteration
Mansour, Yishay, Singh, Satinder
Decision-making problems in uncertain or stochastic domains are often formulated as Markov decision processes (MD Ps). Policy iteration (PI) is a popular algorithm for searching over policy-space, the size of which is exponential in the number of states. We are interested in bounds on the complexity of PI that do not depend on the value of the discount factor. In this paper we prove the first such nontrivial, worst-case, upper bounds on the number of iterations required by PI to converge to the optimal policy. Our analysis also sheds new light on the manner in which PI progresses through the space of policies.
Polynomial Value Iteration Algorithms for Detrerminstic MDPs
Value iteration is a commonly used and empirically competitive method in solving many Markov decision process problems. However, it is known that value iteration has only pseudo-polynomial complexity in general. We establish a somewhat surprising polynomial bound for value iteration on deterministic Markov decision (DMDP) problems. We show that the basic value iteration procedure converges to the highest average reward cycle on a DMDP problem in heta(n^2) iterations, or heta(mn^2) total time, where n denotes the number of states, and m the number of edges. We give two extensions of value iteration that solve the DMDP in heta(mn) time. We explore the analysis of policy iteration algorithms and report on an empirical study of value iteration showing that its convergence is much faster on random sparse graphs.