Convergence Analysis of Policy Iteration
This short study investigates the convergence of the policy iteration (PI) as one of the schemes in implementation of adaptive/approximate dynamic programming (ADP), sometimes referred to by reinforcement learning (RL) or neuro-dynamic programming (NDP), [1]- [11]. Compared to its alternative, i.e., value iteration (VI), the PI calls for a higher computational load per iteration, due to a'full backup' as opposed to a'partial backup' in VI, [12]. However, the PI has the advantage that the control under evolution remains stabilizing [10], hence, it is more suitable for online implementation, i.e., adapting the control'on the fly'. The convergence analyses for PI with continuous state and control spaces and an undiscounted cost function are given in [10]. The results presented in this study however, are from a different viewpoint with different assumptions and lines of proofs. Moreover, interested readers are referred to the results from a simultaneous research (at least in terms of the availability of the results to the public) presented in [13], which are the closest to the first two theorems of this study.
May-19-2015
- Country:
- North America > United States (0.46)
- Europe > United Kingdom
- England (0.28)
- Genre:
- Research Report > New Finding (0.48)
- Technology: