Gradient Descent for General Reinforcement Learning
III, Leemon C. Baird, Moore, Andrew W.
–Neural Information Processing Systems
These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q learning, SARSA, and advantage learning. In addition to these value-based algorithms it also generates pure policy-search reinforcement-learning algorithms, which learn optimal policies without learning a value function.
Neural Information Processing Systems
Dec-31-1999
- Country: