Bandits with Single-Peaked Preferences and Limited Resources

Keinan, Gur, Torkan, Rotem, Ben-Porat, Omer

arXiv.org Artificial Intelligence 

Modern recommendation systems often face the challenge of personalization at scale--learning individual user preferences while simultaneously satisfying global resource allocation constraints. To illustrate, consider a content platform that must decide which content creators to commission daily, where each creator has a different cost and produces ephemeral content on specific topics. Each user has preferences over all creators' content styles and topics. After commissioning a subset of creators that fit the platform's budget, it matches each user to content from one of these creators, where the same creator's content can be recommended to multiple users. The challenge lies in learning individual user preferences for each creator's content while selecting which creators to commission and how to assign their content to maximize user satisfaction. This problem fits the combinatorial multi-armed bandit framework, where the decision-maker must choose structured action sets [8], such as assigning each user to an item. The goal is to maximize cumulative reward, or equivalently, minimize regret by balancing exploration and exploitation. Unfortunately, combinatorial problems like the one in the example above are NP-complete even for offline settings. Therefore, traditional approaches settle for weaker notions of α-regret [8], competing against the best ef-All authors contributed equally to this work.