Collaborating Authors

Krause, Andreas

Stochastic Linear Bandits Robust to Adversarial Attacks Machine Learning

Over the past years, bandit algorithms have found application in computational advertising, recommender systems, clinical trials, and many more. These algorithms make online decisions by balancing between exploiting previously high-reward actions vs. exploring less known ones that could potentially lead to higher rewards. Bandit problems can roughly be categorized [17] into stochastic bandits, in which subsequently played actions yield independent rewards, and adversarial bandits, where the rewards are chosen by an adversary, possibly subject to constraints. A recent line of works has sought to reap the benefits of both approaches by studying bandit problems that are stochastic in nature, but with rewards subject to a limited amount of adversarial corruption. Various works have developed provably robust algorithms [4, 11, 20, 23], and attacks have been designed that cause standard algorithms to fail [9, 11, 12, 21]. While near-optimal theoretical guarantees have been established in the case of independent arms [11], more general settings remain relatively poorly understood or even entirely unexplored; see Section 1.2 for details. Our primary goal is to bridge these gaps via a detailed study of stochastic linear bandits with adversarial corruptions. In the case of a fixed finite (but possibly very large) set of arms, we develop an elimination-based robust algorithm and provide regret bounds with a near-optimal joint dependence on the time horizon and the adversarial attack budget, demonstrating distinct behavior depending on whether the attack budget is known or unknown. In addition, we introduce a novel contextual linear bandit setting under adversarial corruptions, and show that under a context diversity assumption, a simple greedy algorithm attains near-optimal regret under adversarial corruptions, despite having no built-in mechanism that explicitly encourages exploration or robustness.

Gradient Estimation with Stochastic Softmax Tricks Machine Learning

The Gumbel-Max trick is the basis of many relaxed gradient estimators. These estimators are easy to implement and low variance, but the goal of scaling them comprehensively to large combinatorial distributions is still outstanding. Working within the perturbation model framework, we introduce stochastic softmax tricks, which generalize the Gumbel-Softmax trick to combinatorial spaces. Our framework is a unified perspective on existing relaxed estimators for perturbation models, and it contains many novel relaxations. We design structured relaxations for subset selection, spanning trees, arborescences, and others. When compared to less structured baselines, we find that stochastic softmax tricks can be used to train latent variable models that perform better and discover more latent structure.

Coresets via Bilevel Optimization for Continual Learning and Streaming Machine Learning

Coresets are small data summaries that are sufficient for model training. They can be maintained online, enabling efficient handling of large data streams under resource constraints. However, existing constructions are limited to simple models such as k-means and logistic regression. In this work, we propose a novel coreset construction via cardinality-constrained bilevel optimization. We show how our framework can efficiently generate coresets for deep neural networks, and demonstrate its empirical benefits in continual learning and in streaming settings.

Online Active Model Selection for Pre-trained Classifiers Machine Learning

Model selection from a set of pre-trained models is an emerging problem in machine learning and has implications in several practical scenarios. Industrial examples include cases in which a telecommunication company or a flight booking company have multiple ML models trained over different sliding windows of data and hope to pick the one that performs the best on a given day. For many real-world problems, unlabeled data is abundant and can be inexpensively collected, while labels are expensive to acquire and require human expertise. Consequently, there is a need to robustly identify the best model under limited labeling resources. Similarly, one often needs reasonable predictions for the unlabeled data while keeping the labeling budget low. Depending on data availability, one can consider two settings: the pool-based setting assumes that the learner has access to a pool of unlabeled data, and she tries to select informative data samples from the pool to achieve her task. The online (streaming) setting works with a stream of data, where the data arrives one at a time, and the learner decides to ask for the label of the data samples on the go or to just throw the sample away. While offering less options on which data to label next, this streaming setting alleviates the scalability challenge of storing and processing a large pool of examples in the pool-based setting. Another important aspect is the nature of the data: the instance/label pairs might be sampled i.i.d.

Logistic $Q$-Learning Artificial Intelligence

While REPS is elegantly derived from a principled We propose a new reinforcement learning algorithm linear-programing (LP) formulation of optimal control derived from a regularized linearprogramming in MDPs, it has the serious shortcoming that its faithful formulation of optimal control implementation requires access to the true MDP in MDPs. The method is closely related to for both the policy evaluation and improvement steps, the classic Relative Entropy Policy Search even at deployment time. The usual way to address (REPS) algorithm of Peters et al. (2010), with this limitation is to use an empirical approximation to the key difference that our method introduces the policy evaluation step and to project the policy a Q-function that enables efficient exact from the improvement step into a parametric space model-free implementation. The main (Deisenroth et al., 2013), losing all the theoretical feature of our algorithm (called Q-REPS) is guarantees of REPS in the process.

Semi-supervised Batch Active Learning via Bilevel Optimization Machine Learning

Active learning is an effective technique for reducing the labeling cost by improving data efficiency. In this work, we propose a novel batch acquisition strategy for active learning in the setting where the model training is performed in a semi-supervised manner. We formulate our approach as a data summarization problem via bilevel optimization, where the queried batch consists of the points that best summarize the unlabeled data pool. We show that our method is highly effective in keyword detection tasks in the regime when only few labeled samples are available.

Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator Machine Learning

Gradient estimation in models with discrete latent variables is a challenging problem, because the simplest unbiased estimators tend to have high variance. To counteract this, modern estimators either introduce bias, rely on multiple function evaluations, or use learned, input-dependent baselines. Thus, there is a need for estimators that require minimal tuning, are computationally cheap, and have low mean squared error. In this paper, we show that the variance of the straight-through variant of the popular Gumbel-Softmax estimator can be reduced through Rao-Blackwellization without increasing the number of function evaluations. This provably reduces the mean squared error. We empirically demonstrate that this leads to variance reduction, faster convergence, and generally improved performance in two unsupervised latent variable models.

Learning Set Functions that are Sparse in Non-Orthogonal Fourier Bases Artificial Intelligence

Many applications of machine learning on discrete domains, such as learning preference functions in recommender systems or auctions, can be reduced to estimating a set function that is sparse in the Fourier domain. In this work, we present a new family of algorithms for learning Fourier-sparse set functions. They require at most $nk - k \log_2 k + k$ queries (set function evaluations), under mild conditions on the Fourier coefficients, where $n$ is the size of the ground set and $k$ the number of non-zero Fourier coefficients. In contrast to other work that focused on the orthogonal Walsh-Hadamard transform, our novel algorithms operate with recently introduced non-orthogonal Fourier transforms that offer different notions of Fourier-sparsity. These naturally arise when modeling, e.g., sets of items forming substitutes and complements. We demonstrate effectiveness on several real-world applications.

Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning Machine Learning

Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for {\em learning} the model, they ignore it when {\em optimizing} the policy. In this paper, we show that ignoring the epistemic uncertainty leads to greedy algorithms that do not explore sufficiently. In turn, we propose a {\em practical optimistic-exploration algorithm} (\alg), which enlarges the input space with {\em hallucinated} inputs that can exert as much control as the {\em epistemic} uncertainty in the model affords. We analyze this setting and construct a general regret bound for well-calibrated models, which is provably sublinear in the case of Gaussian Process models. Based on this theoretical foundation, we show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models. Our experiments demonstrate that optimistic exploration significantly speeds up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.

Learning to Play Sequential Games versus Unknown Opponents Artificial Intelligence

We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action. We seek to design strategies for the learner to successfully interact with the opponent. While most previous approaches consider known opponent models, we focus on the setting in which the opponent's model is unknown. To this end, we use kernel-based regularity assumptions to capture and exploit the structure in the opponent's response. We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents. The algorithm combines ideas from bilevel optimization and online learning to effectively balance between exploration (learning about the opponent's model) and exploitation (selecting highly rewarding actions for the learner). Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response and scale sublinearly with the number of game rounds. Moreover, we specialize our approach to repeated Stackelberg games, and empirically demonstrate its effectiveness in a traffic routing and wildlife conservation task.