Reviews: Recovering Bandits
–Neural Information Processing Systems
The major point that needs to be clarified for me is the distinction between single play regret and multiple play regret. The UCB-Z algorithm is not of this type for example. It is a bit weird to have regret notion that only apply to very specific algorithms - second, regarding the distinction between single and multiple play regret, am I right that E[R_T {(d,m)}] E[R_T {(d)}]? That is, you would in principle care about the multiple play lookahead regret, as one should be allowed to be play an arm multiple times during the d-time step on which we optimize. From my understanding, it would be defined only for strategies which selects each arm at most once during d steps (and therefore d cannot be larger than K in this case, right?).
Neural Information Processing Systems
Jan-25-2025, 23:18:58 GMT
- Technology: