Dissecting Reinforcement Learning-Part.3
The update rule is based on the tuple State-Reward-State. Remember that now we are in the control case. Here we use the Q-function (see second post) to estimate the best policy. The Q-function requires as input a state-action pair.
Jan-7-2018, 01:02:07 GMT