Context-dependent upper-confidence bounds for directed exploration

Kumaraswamy, Raksha, Schlegel, Matthew, White, Adam, White, Martha

Feb-14-2020, 15:27:38 GMT–Neural Information Processing Systems

Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like e-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches--because they summarize past interactions--with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD).

context-dependent upper-confidence, exploration, exploration strategy, (4 more...)

Neural Information Processing Systems

Feb-14-2020, 15:27:38 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)