Convergence Analysis of Policy Iteration

Heydari, Ali

arXiv.org Machine Learning 

This short study investigates the convergence of the policy iteration (PI) as one of the schemes in implementation of adaptive/approximate dynamic programming (ADP), sometimes referred to by reinforcement learning (RL) or neuro-dynamic programming (NDP), [1]- [11]. Compared to its alternative, i.e., value iteration (VI), the PI calls for a higher computational load per iteration, due to a'full backup' as opposed to a'partial backup' in VI, [12]. However, the PI has the advantage that the control under evolution remains stabilizing [10], hence, it is more suitable for online implementation, i.e., adapting the control'on the fly'. The convergence analyses for PI with continuous state and control spaces and an undiscounted cost function are given in [10]. The results presented in this study however, are from a different viewpoint with different assumptions and lines of proofs. Moreover, interested readers are referred to the results from a simultaneous research (at least in terms of the availability of the results to the public) presented in [13], which are the closest to the first two theorems of this study.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found