A Q-value convergence We here show that if a tabular agent converges to a policy π in a continuous NDP then Q

Neural Information Processing Systems 

See Singh et al. (2000). Moreover, SARSA and Expected SARSA are also both appropriate, if the agent is greedy in the limit. Note that condition 2 requires that the agent takes every action in every state infinitely many times Proof. Let A satisfy the following in a given NDP: A is greedy in the limit, i.e. for all δ > 0, P (Q A's Q-values are accurate in the limit, i.e. if π Then φ has a fixed point. Theorem 3. Every continuous NDP has a strongly ratifiable policy.