Variance Adjusted Actor Critic Algorithms
In Reinforcement Learning (RL; [1]) and planning in Markov Decision Processes (MDPs; [2]), the typical objective is to maximize the cumulative (possibly discounted) expected reward, denoted by J. When the model's parameters are known, several well-established and efficient optimization algorithms are known. When the model parameters are not known, learning is needed and there are several algorithmic frameworks that solve the learning problem effectively, at least when the model is finite. Among these, actor-critic methods [3] are known to be particularly efficient. In typical actor-critic algorithms, the critic maintains an estimate of the value function - the expected reward-to-go. This function is then used by the actor to estimate the gradient of the objective with respect to some policy parameters, and then improve the policy by modifying the parameters in the direction of the gradient. The theory that underlies actor-critic algorithms is the policy gradient theorem [4], which relates the value function with the policy gradient.
Oct-14-2013