Bandit Overfitting in Offline Policy Learning

Brandfonbrener, David, Whitney, William F., Ranganath, Rajesh, Bruna, Joan

Nov-10-2020–arXiv.org Machine Learning

We study the offline policy learning problem in a contextual bandit framework. Specifically, we focus on the issue of overfitting which is especially important in a modern context where we often use overparameterized models that can interpolate the data. Our first contribution is to introduce a regret decomposition into approximation, estimation, and bandit errors that emphasizes the distinction between the policy learning and supervised learning problems. The bandit error measures the error from overfitting to the single action observed at each context, which we call "bandit overfitting". Our second contribution is to show both in theory and experiments how bandit overfitting is different for policy-based versus value-based algorithms when we use overparameterized models. We find that bandit overfitting can become a severe problem for policy-based algorithms, but value-based algorithms effectively reduce the policy learning problem to regression and thus avoid the worst problems of bandit overfitting.

baseline, behavior policy, policy optimization, (14 more...)

arXiv.org Machine Learning

Nov-10-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education > Focused Education > Special Education (0.65)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Neural Networks (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found