Self-Play Learning Without a Reward Metric

Schmidt, Dan, Moran, Nick, Rosenfeld, Jonathan S., Rosenthal, Jonathan, Yedidia, Jonathan

Dec-16-2019–arXiv.org Machine Learning

The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.

algorithm, game outcome, reward function, (13 more...)

arXiv.org Machine Learning

Dec-16-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Germany
  - Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment > Games (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (0.93)
  - Machine Learning > Reinforcement Learning (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found