An introduction to Policy Gradients with Cartpole and Doom
In the last two articles about Q-learning and Deep Q learning, we worked with value-based reinforcement learning algorithms. To choose which action to take given a state, we take the action with the highest Q-value (maximum expected future reward I will get at each state). As a consequence, in value-based learning, a policy exists only because of these action-value estimates. Today, we'll learn a policy-based reinforcement learning technique called Policy Gradients. The first will learn to keep the bar in balance.
May-16-2018, 03:06:38 GMT