Reinforcement Learning -- Policy Approximation
Till now, all algorithms being introduced are either value function or Q function based gradient algorithm, that is we assume there exists a true value V(or Q) for different state S(or [S, A]), and to approach the true value we use gradient method that comes with either V or Q in the formula, and and the end of the learning process, a policy π(A S) is generated by choosing the most rewarding action at each state based on V or Q function estimation. However, policy gradient method proposes a total different view on reinforcement learning problems, instead of learning a value function, one can directly learn or update a policy. Remember in previous posts, the policy being used in the learning process is always ϵ-greedy, which means the agent will take random action will a certain probability and take greedy action in the rest. However, in gradient policy method, the problem is formulated as, P(A S, θ) π(A S, θ), which is saying, for each state, the policy gives a probability of each action possible taken from that state, and in order to optimise the policy, it is parameterised with θ (similar to weight parameter w in value function we introduced before). And because of J is a representation of policy π, we know that the update of θ will include the current policy, and after a series of deduction(for details, please refer to Sutton's book, chapter 13), we get the update process: G is still the cumulative discounted reward, and the parameter θ will be updated with current derivative of policy.
Oct-1-2019, 06:38:00 GMT
- Technology: