Goto

Collaborating Authors

 simplex-pi


Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Neural Information Processing Systems

Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal "-discounted optimal policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ïadvantage. 1 We 2Ìshow that 1 Howard's 1 PI 22 terminates


Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Scherrer, Bruno

arXiv.org Artificial Intelligence

Given a Markov Decision Process (MDP) with $n$ states and a totalnumber $m$ of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimal$\gamma$-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantage. We show that Howard's PI terminates after at most $O\left(\frac{m}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)$iterations, improving by a factor $O(\log n)$ a result by Hansen etal., while Simplex-PI terminates after at most $O\left(\frac{nm}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)$iterations, improving by a factor $O(\log n)$ a result by Ye. Undersome structural properties of the MDP, we then consider bounds thatare independent of the discount factor~$\gamma$: quantities ofinterest are bounds $\tau\_t$ and $\tau\_r$---uniform on all states andpolicies---respectively on the \emph{expected time spent in transientstates} and \emph{the inverse of the frequency of visits in recurrentstates} given that the process starts from the uniform distribution.Indeed, we show that Simplex-PI terminates after at most $\tilde O\left(n^3 m^2 \tau\_t \tau\_r \right)$ iterations. This extends arecent result for deterministic MDPs by Post & Ye, in which $\tau\_t\le 1$ and $\tau\_r \le n$, in particular it shows that Simplex-PI isstrongly polynomial for a much larger class of MDPs. We explain whysimilar results seem hard to derive for Howard's PI. Finally, underthe additional (restrictive) assumption that the state space ispartitioned in two sets, respectively states that are transient andrecurrent for all policies, we show that both Howard's PI andSimplex-PI terminate after at most $\tilde O(m(n^2\tau\_t+n\tau\_r))$iterations.


Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Scherrer, Bruno

Neural Information Processing Systems

Given a Markov Decision Process (MDP) with $n$ states and $m$ actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\gamma$-discounted optimal policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most $ O \left( \frac{ n m}{1-\gamma} \log \left( \frac{1}{1-\gamma} \right)\right) $ iterations, improving by a factor $O(\log n)$ a result by Hansen et al. (2013), while Simplex-PI terminates after at most $ O \left( \frac{n^2 m}{1-\gamma} \log \left( \frac{1}{1-\gamma} \right)\right) $ iterations, improving by a factor $O(\log n)$ a result by Ye (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor~$\gamma$: given a measure of the maximal transient time $\tau_t$ and the maximal time $\tau_r$ to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most $ \tilde O \left( n^3 m^2 \tau_t \tau_r \right) $ iterations. This generalizes a recent result for deterministic MDPs by Post & Ye (2012), in which $\tau_t \le n$ and $\tau_r \le n$. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that Simplex-PI and Howard's PI terminate after at most $ \tilde O(nm (\tau_t+\tau_r))$ iterations.