Thompson Sampling for Noncompliant Bandits

arXiv.org Machine Learning

Thompson sampling, a Bayesian method for balancing exploration and exploitation in bandit problems, has theoretical guarantees and exhibits strong empirical performance in many domains. Traditional Thompson sampling, however, assumes perfect compliance, where an agent's chosen action is treated as the implemented action. This article introduces a stochastic noncompliance model that relaxes this assumption. We prove that any noncompliance in a 2-armed Bernoulli bandit increases existing regret bounds. With our noncompliance model, we derive Thompson sampling variants that explicitly handle both observed and latent noncompliance. With extensive empirical analysis, we demonstrate that our algorithms either match or outperform traditional Thompson sampling in both compliant and noncompliant environments.



Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits

arXiv.org Machine Learning

I analyse the frequentist regret of the famous Gittins index strategy for multi-armed bandits with Gaussian noise and a finite horizon. Remarkably it turns out that this approach leads to finite-time regret guarantees comparable to those available for the popular UCB algorithm. Along the way I derive finite-time bounds on the Gittins index that are asymptotically exact and may be of independent interest. I also discuss some computational issues and present experimental results suggesting that a particular version of the Gittins index strategy is a modest improvement on existing algorithms with finite-time regret guarantees such as UCB and Thompson sampling.


Bandit Learning Through Biased Maximum Likelihood Estimation

arXiv.org Machine Learning

We propose BMLE, a new family of bandit algorithms, that are formulated in a general way based on the Biased Maximum Likelihood Estimation method originally appearing in the adaptive control literature. We design the cost-bias term to tackle the exploration and exploitation tradeoff for stochastic bandit problems. We provide an explicit closed form expression for the index of an arm for Bernoulli bandits, which is trivial to compute. We also provide a general recipe for extending the BMLE algorithm to other families of reward distributions. We prove that for Bernoulli bandits, the BMLE algorithm achieves a logarithmic finite-time regret bound and hence attains order-optimality. Through extensive simulations, we demonstrate that the proposed algorithms achieve regret performance comparable to the best of several state-of-the-art baseline methods, while having a significant computational advantage in comparison to other best performing methods. The generality of the proposed approach makes it possible to address more complex models, including general adaptive control of Markovian systems.


Statistical Anomaly Detection for Train Fleets

AAAI Conferences

We have developed a method for statistical anomaly detection which has been deployed in a tool for condition monitoring of train fleets. The tool is currently used by several railway operators over the world to inspect and visualize the occurrence of event messages generated on the trains. The anomaly detection component helps the operators to quickly find significant deviations from normal behavior and to detect early indications for possible problems. The savings in maintenance costs comes mainly from avoiding costly breakdowns, and have been estimated to several million Euros per year for the tool. In the long run, it is expected that maintenance costs can be reduced with between 5 and 10 % by using the tool.