r/MachineLearning - [R] [1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

#artificialintelligence 

Much of it is the same as Value Prediction Networks, which proposes that instead of training a model to minimize L2 prediction-loss, you just train it to get the long-term reward/value right for a start state and a series of actions. That gets around a lot of the difficulty of using MBRL for Atari-like things, where it's very hard to accurately predict next pixels. They pretty much simulate a dense tree to some short depth, assign estimated values to the nodes, and use that for action selection. One is that you're probably simulating a lot of states that your value-function would tell you are DEFINITELY not worthwhile. Atari has 16 actions -- it's unfeasible to simulate more than 3 states deep. And since you're simulating in all directions, but only taking the best (e-greedy) action, you're not going to gather training data on most of the transitions you're estimating.