Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
Steckelmacher, Denis, Plisnier, Hélène, Roijers, Diederik M., Nowé, Ann
–arXiv.org Artificial Intelligence
We argue that actorcritic PGQL (O'Donoghue et al., 2017) allows for an off-policy algorithms are currently limited by their V function, but requires it to be combined with on-policy need for an on-policy critic, which severely constraints advantage values. Notable examples of algorithms without how the critic is learned. We propose an on-policy critic are AlphaGo Zero (Silver et al., 2017), Bootstrapped Dual Policy Iteration (BDPI), that replaces the critic with a slow-moving target policy a novel model-free actor-critic reinforcementlearning learned with tree search, and the Actor-Mimic (Parisotto algorithm for continuous states and et al., 2016), that minimizes the cross-entropy between discrete actions, with off-policy critics. Offpolicy an actor and the Softmax policies of critics (see Section critics are compatible with experience replay, 4.2). The need of most actor-critic algorithms for an onpolicy ensuring high sample-efficiency, without critic makes them incompatible with state-of-the-art the need for off-policy corrections. The actor, value-based algorithms of the Q-Learning family (Arjona-by slowly imitating the average greedy policy Medina et al., 2018; Hessel et al., 2017), that are all highly of the critics, leads to high-quality and statespecific sample-efficient but off-policy. In a discrete-actions setting, exploration, which we show approximates where off-policy value-based methods can be used, Thompson sampling. Because the actor this raises two questions: and critics are fully decoupled, BDPI is remarkably stable and, contrary to other state-of-theart 1. Can we use off-policy value-based algorithms in an algorithms, unusually forgiving for poorlyconfigured actor-critic setting?
arXiv.org Artificial Intelligence
Mar-11-2019
- Country:
- Europe
- Belgium (0.14)
- Netherlands (0.14)
- Europe
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Leisure & Entertainment > Games > Go (0.34)
- Technology: