Before AlphaGo there was TD-Gammon -- Jim Fleming


Gerald Tesauro published his paper in 1992 describing TD-Gammon as a neural network trained with reinforcement learning. There are two tracks, moving in opposite directions, and players take turns rolling dice to move their checkers from one end of their track to the other, called "home". TD-Gammon consists of a simple three-layer neural network trained using a reinforcement learning technique known as TD-Lambda or temporal-difference learning with a trace decay parameter lambda (?). Now when we backpropagate the end game state, we take into account the gradients from earlier states in the game while we avoid keeping a complete history of gradients.