GAN Q-learning – Arxiv Vanity
Up to now, deep learning methods in RL used multiple function approximators (typically a network with shared hidden layers) to fit a state value or state-action value distribution. For instance, bootstrappedDQN () used k-heads on the state-action value function Q for every available action and used it to model a distribution. In bayesianpol (), a Bayesian framework was applied to the actor-critic architecture by fitting a Gaussian Process (GP) instead of the critic, hence allowing for a closed-form derivation of update rules. More recently, bellemare2017distributional () introduced a distributional algorithm C51 which aimed to solve the RL problem by learning a categorical probability vector over returns Q. Unlike GANRL () which uses a generative network to learn the underlying transition model of the environment, we utilize a generative network to model the distribution approximation of the Bellman updates.
May-25-2018, 15:21:42 GMT
- Technology: