Differentiable Meta-Learning in Contextual Bandits

Kveton, Branislav, Mladenov, Martin, Hsu, Chih-Wei, Zaheer, Manzil, Szepesvari, Csaba, Boutilier, Craig

Jun-9-2020–arXiv.org Machine Learning

We study a contextual bandit setting where the learning agent has access to sampled bandit instances from an unknown prior distribution $\mathcal{P}$. The goal of the agent is to achieve high reward on average over the instances drawn from $\mathcal{P}$. This setting is of a particular importance because it formalizes the offline optimization of bandit policies, to perform well on average over anticipated bandit instances. The main idea in our work is to optimize differentiable bandit policies by policy gradients. We derive reward gradients that reflect the structure of our problem, and propose contextual policies that are parameterized in a differentiable way and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range of classification tasks.

big data, cosoftelim, health & medicine, (20 more...)

arXiv.org Machine Learning

Jun-9-2020

arXiv.org PDF

Add feedback

Country:
- North America (0.46)

Genre:
- Research Report (1.00)

Industry:
- Education (0.46)
- Health & Medicine (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Learning Graphical Models (0.46)
      - Statistical Learning (0.46)
    - Representation & Reasoning > Uncertainty (0.67)
  - Data Science > Data Mining
    - Big Data (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found