Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Steckelmacher, Denis, Plisnier, Hélène, Roijers, Diederik M., Nowé, Ann

arXiv.org Artificial Intelligence 

We argue that actorcritic PGQL (O'Donoghue et al., 2017) allows for an off-policy algorithms are currently limited by their V function, but requires it to be combined with on-policy need for an on-policy critic, which severely constraints advantage values. Notable examples of algorithms without how the critic is learned. We propose an on-policy critic are AlphaGo Zero (Silver et al., 2017), Bootstrapped Dual Policy Iteration (BDPI), that replaces the critic with a slow-moving target policy a novel model-free actor-critic reinforcementlearning learned with tree search, and the Actor-Mimic (Parisotto algorithm for continuous states and et al., 2016), that minimizes the cross-entropy between discrete actions, with off-policy critics. Offpolicy an actor and the Softmax policies of critics (see Section critics are compatible with experience replay, 4.2). The need of most actor-critic algorithms for an onpolicy ensuring high sample-efficiency, without critic makes them incompatible with state-of-the-art the need for off-policy corrections. The actor, value-based algorithms of the Q-Learning family (Arjona-by slowly imitating the average greedy policy Medina et al., 2018; Hessel et al., 2017), that are all highly of the critics, leads to high-quality and statespecific sample-efficient but off-policy. In a discrete-actions setting, exploration, which we show approximates where off-policy value-based methods can be used, Thompson sampling. Because the actor this raises two questions: and critics are fully decoupled, BDPI is remarkably stable and, contrary to other state-of-theart 1. Can we use off-policy value-based algorithms in an algorithms, unusually forgiving for poorlyconfigured actor-critic setting?

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found