Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
–Neural Information Processing Systems
Given a Markov Decision Process (MDP) with $n$ states and $m$ actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\gamma$-discounted optimal policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most $ O \left( \frac{ n m}{1-\gamma} \log \left( \frac{1}{1-\gamma} \right)\right) $ iterations, improving by a factor $O(\log n)$ a result by Hansen et al. (2013), while Simplex-PI terminates after at most $ O \left( \frac{n^2 m}{1-\gamma} \log \left( \frac{1}{1-\gamma} \right)\right) $ iterations, improving by a factor $O(\log n)$ a result by Ye (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor~$\gamma$: given a measure of the maximal transient time $\tau_t$ and the maximal time $\tau_r$ to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most $ \tilde O \left( n^3 m^2 \tau_t \tau_r \right) $ iterations.
Neural Information Processing Systems
Sep-30-2025, 11:38:35 GMT
- Technology: