Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Oct-5-2021–arXiv.org Artificial Intelligence

A practical reinforcement learning (RL) algorithm is often in an actor-critic setting (Lin, 1992; Precup et al., 2000) where the policy (actor) generates actions and the Q/value function (critic) evaluates the policy's performance. Under this setting, off-policy RL uses transitions sampled from a replay buffer to perform Q function updates, yielding a new policy π. Then, a finite-length trajectory under π is added to the buffer, and the process repeats. Notice that sampling from a replay buffer is an offline operation and that the growth of replay buffer is an online operation. This implies off-policy actor-critic RL lies between offline RL (Yu et al., 2020; Levine et al., 2020) and on-policy RL (Schulman et al., 2015, 2017).

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

Oct-5-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:
- Research Report (0.81)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)