Marketers are busy people – they have content to produce, audiences to engage and campaigns to run. They also have to make sure their content and offers are relevant to their visitors and email recipients – ideally at an individual level. But how can this be accomplished at scale? We are excited to introduce "Contextual Bandit," a major enhancement to the machine learning capabilities of the Evergage platform. Contextual Bandit is a sophisticated algorithm that evaluates both the probability of someone engaging with a particular offer as well as the business value of the offer to the company.

Oetomo, Bastian, Perera, Malinga, Borovica-Gajic, Renata, Rubinstein, Benjamin I. P.

We revisit the proof by Qin et al. (2014) of bounded regret of the C$^2$UCB contextual combinatorial bandit. We demonstrate an error in the proof of volumetric expansion of the moment matrix, used in upper bounding a function of context vector norms. We prove a relaxed inequality that yields the originally-stated regret bound.

Krishnamurthy, Akshay, Wu, Zhiwei Steven, Syrgkanis, Vasilis

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds, when the linear function is $d$-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenewald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.

Syrgkanis, Vasilis, Luo, Haipeng, Krishnamurthy, Akshay, Schapire, Robert E.

We propose a new oracle-based algorithm, BISTRO+, for the adversarial contextual bandit problem, where either contexts are drawn i.i.d. or the sequence of contexts is known a priori, but where the losses are picked adversarially. Our algorithm is computationally efficient, assuming access to an offline optimization oracle, and enjoys a regret of order $O((KT)^{\frac{2}{3}}(\log N)^{\frac{1}{3}})$, where $K$ is the number of actions, $T$ is the number of iterations, and $N$ is the number of baseline policies. Our result is the first to break the $O(T^{\frac{3}{4}})$ barrier achieved by recent algorithms, which was left as a major open problem. Our analysis employs the recent relaxation framework of (Rakhlin and Sridharan, ICML'16).

Joseph, Matthew, Kearns, Michael, Morgenstern, Jamie, Roth, Aaron

We introduce the study of fairness in multi-armed bandit problems. Our fairness definition can be interpreted as demanding that given a pool of applicants (say, for college admission or mortgages), a worse applicant is never favored over a better one, despite a learning algorithm's uncertainty over the true payoffs. We prove results of two types. First, in the important special case of the classic stochastic bandits problem (i.e., in which there are no contexts), we provide a provably fair algorithm based on "chained" confidence intervals, and provide a cumulative regret bound with a cubic dependence on the number of arms. We further show that any fair algorithm must have such a dependence. When combined with regret bounds for standard non-fair algorithms such as UCB, this proves a strong separation between fair and unfair learning, which extends to the general contextual case. In the general contextual case, we prove a tight connection between fairness and the KWIK (Knows What It Knows) learning model: a KWIK algorithm for a class of functions can be transformed into a provably fair contextual bandit algorithm, and conversely any fair contextual bandit algorithm can be transformed into a KWIK learning algorithm. This tight connection allows us to provide a provably fair algorithm for the linear contextual bandit problem with a polynomial dependence on the dimension, and to show (for a different class of functions) a worst-case exponential gap in regret between fair and non-fair learning algorithms