From Contextual Combinatorial Semi-Bandits to Bandit List Classification: Improved Sample Complexity with Sparse Rewards
–Neural Information Processing Systems
We study the problem of contextual combinatorial semi-bandits, where input contexts are mapped into subsets of size $m$ of a collection of $K$ possible actions. In each round of the interaction, the learner observes feedback consisting of the realized reward of the predicted actions. Motivated by prototypical applications of contextual bandits, we focus on the $s$-sparse regime where we assume that the sum of rewards is bounded by some value $s \ll K$. For example, in recommendation systems the number of products purchased by any customer is significantly smaller than the total number of available products. Our main result is for the $(\varepsilon,\delta)$-PAC variant of the problem for which we design an algorithm that returns an $\varepsilon$-optimal policy with high probability using a sample complexity of $\widetilde{O}\big( (\mathrm{poly}(K/m) + sm / \varepsilon^2) \log (|\Pi|/\delta) \big)$ where $\Pi$ is the underlying (finite) class and $s$ is the sparsity parameter.
Neural Information Processing Systems
Jun-14-2026, 06:02:37 GMT
- Technology: