Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations
Burnetas, Apostolos N., Kanavetas, Odysseas, Katehakis, Michael N.
Consider the problem of sequentially activating one of a finite number of independent bandits, where each activation of a bandit incurs a number of bandit dependent resource utilizations, or activation costs. For each resource type the constraint insures that the total resource utilized (or equivalently cost incurred) at any time does not exceed the current resource availability (budget). It is assumed that following each activation any unused resource amounts can be carried forward for use in future activations. We also make the assumption that successive activations of each bandit yield independent, among different bandits, identically distributed (iid) random rewards with positive means, and distributions that depend on unknown parameters. The objective isto obtain a feasible policy that maximizes asymptotically the total expect rewards or equivalently, minimizes asymptotically a regret function. We develop a class of feasible policies that are shown to be asymptotically optimal within a large class of good policies that uniformly fast (UF) convergent, in the sense of Burnetas and Katehakis (1996) and Lai and Robbins (1985). The results in this paper extend the work in Burnetas et al. (2017) which solved the case where there exists only one type of constraint for all bandits.
Dec-13-2018
- Country:
- Europe > Greece (0.14)
- North America > United States (0.46)
- Genre:
- Research Report (0.40)
- Technology: