ar-ucb
Autoregressive Bandits
Bacchiocchi, Francesco, Genalti, Gianmarco, Maran, Davide, Mussi, Marco, Restelli, Marcello, Gatti, Nicola, Metelli, Alberto Maria
In this paper, we model the reward of Autoregressive processes naturally arise in a a sequential decision-making problem as an AR process large variety of real-world scenarios, including where its parameters depend on the action selected by the e.g., stock markets, sell forecasting, weather agent at every round. This scenario can be regarded as an prediction, advertising, and pricing. When extension of the multi-armed bandit (MAB, Lattimore & addressing a sequential decision-making problem Szepesvári, 2020) problem, in which an AR process governs in such a context, the temporal dependence the temporal structure of the observed rewards that between consecutive observations should be is, through the action-dependent AR parameters, that are properly accounted for converge to the optimal unknown to the agent. It is worth mentioning that such decision policy. In this work, we propose a novel a scenario displays notable differences compared to more online learning setting, named Autoregressive traditional non-stationary MABs (Gur et al., 2014). Indeed, Bandits (ARBs), in which the observed reward in the presented scenario, we can exploit the knowledge that follows an autoregressive process of order k, the underlying process is AR and, more importantly, that whose parameters depend on the action the such a dynamic depends on the agent's action.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- Asia > Middle East > Jordan (0.04)
Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback
Wang, Siwei, Wang, Haoyun, Huang, Longbo
We study the multi-armed bandit (MAB) problem with composite and anonymous feedback. In this model, the reward of pulling an arm spreads over a period of time (we call this period as reward interval) and the player receives partial rewards of the action, convoluted with rewards from pulling other arms, successively. Existing results on this model require prior knowledge about the reward interval size as an input to their algorithms. In this paper, we propose adaptive algorithms for both the stochastic and the adversarial cases, without requiring any prior information about the reward interval. For the stochastic case, we prove that our algorithm guarantees a regret that matches the lower bounds (in order). For the adversarial case, we propose the first algorithm to jointly handle non-oblivious adversary and unknown reward interval size. We also conduct simulations based on real-world dataset. The results show that our algorithms outperform existing benchmarks.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.84)