A Natural Policy Gradient
–Neural Information Processing Systems
These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton etal.
Neural Information Processing Systems
Dec-31-2002