Leveraging User-Triggered Supervision in Contextual Bandits
Agarwal, Alekh, Gentile, Claudio, Marinov, Teodor V.
–arXiv.org Artificial Intelligence
How should we leverage such an extra modality of feedback along with the typical reward signal in CBs? We study contextual bandit (CB) problems, While prior works have developed hybrid models such as where the user can sometimes respond with the learning with feedback graphs (e.g., (Mannor & Shamir, best action in a given context. Such an interaction 2011; Caron et al., 2012; Alon et al., 2017)) to capture a arises, for example, in text prediction or autocompletion continuum between supervised and CB learning, such settings settings, where a poor suggestion is simply are not a natural fit here. A key challenge in the ignored and the user enters the desired text feedback structure is that the extra supervised signal is only instead. Crucially, this extra feedback is usertriggered available on a subset of the contexts, which are chosen by on only a subset of the contexts. We develop the user as some unknown function of the algorithm's recommended a new framework to leverage such signals,
arXiv.org Artificial Intelligence
Feb-7-2023