ISL: Optimal Policy Learning With Optimal Exploration-Exploitation Trade-Off
–arXiv.org Artificial Intelligence
Traditionally, off-policy learning algorithms (such as Q-learning) and exploration schemes have been derived separately. Often times, the exploration-exploitation dilemma being addressed through heuristics. In this article we show that both the learning equations and the exploration-exploitation strategy can be derived in tandem as the solution to a unique and well-posed optimization problem whose minimization leads to the optimal value function. We present a new algorithm following this idea. The algorithm is of the gradient type (and therefore has good convergence properties even when used in conjunction with function approximators such as neural networks); it is off-policy; and it specifies both the update equations and the strategy to address the exploration-exploitation dilemma. To the best of our knowledge, this is the first algorithm that has these properties.
arXiv.org Artificial Intelligence
Sep-13-2019
- Country:
- Oceania > Australia (0.14)
- North America > United States
- California > Los Angeles County > Los Angeles (0.28)
- Europe
- Sweden (0.14)
- Switzerland (0.14)
- Spain (0.14)
- Genre:
- Research Report (0.40)
- Technology: