Accelerating Self-Play Learning in Go
In 2017, DeepMind's AlphaGoZero demonstrated in a landmark result that it was possible to achieve superhuman performance in the game of Go starting from random play and learning only via reinforcement learning of a neural network using self-play bootstrapping from Monte-Carlo tree search[9]. Moreover, AlphaGoZero used only fairly minimal game-specific tuning. Subsequently, DeepMind's AlphaZero demonstrated that the same methods could also be used to train extremely strong agents in Chess and Shogi. However, the amount of computation required was large, with DeepMind's main reported run taking about 41 TPU-years in total parallelized over 5000 TPUs [8]. The significant cost of reproducing this work has slowed research, putting it out of reach for all but major companies such as Facebook[11], as well as a few online massively distributed computation projects, notably Leela Zero for Go[14], and Leela Chess Zero for Chess[17]. In this paper, we introduce several new techniques, while also reviving some ideas from pre-AlphaZero research in computer Go and newly applying them to the AlphaZero process. Combined with minor domain-specific heuristic optimizations and overall tuning, these ideas greatly improve the efficiency of self-play learning. Still starting only from random play, training on merely about 30 GPUs for a week our bot KataGo reaches just below the strength of Leela Zero as of Leela Zero's 15-block neural net "LZ130", a likely professional or possibly just-superhuman level when run on strong consumer hardware.
Mar-1-2019
- Country:
- South America > Peru > Loreto Department (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Technology: