Dick, Travis
Online optimization of piecewise Lipschitz functions in changing environments
Balcan, Maria-Florina, Dick, Travis, Sharma, Dravyansh
In an online optimization problem we are required to choose a sequence of points from a fixed feasible subset of $\mathbb{R}^d$. After each point is chosen, a function over the set is revealed and we accumulate payoff equal to the function value at the point. We consider the class of piecewise Lipschitz functions, which is the most general setting considered in the literature for the problem, and arises naturally in various combinatorial algorithm selection problems where utility functions can have sharp discontinuities. The usual performance metric of `static' regret minimizes the gap between the payoff accumulated and that of the best fixed point for the entire duration, and thus fails to capture changing environments. Shifting regret is a useful alternative, which allows for up to $s$ environment shifts. In this work we provide an $O(\sqrt{sdT\log T}+sT^{1-\beta})$ regret bound for $\beta$-dispersed functions, where $\beta$ roughly quantifies the rate at which discontinuities appear in the utility functions in expectation (typically $\beta\ge1/2$ in problems of practical interest). We show this bound is optimal up to sub-logarithmic factors. We further show how to improve the bounds when selecting from a small pool of experts. We empirically demonstrate a key application of our algorithms to online clustering problems, with 15-40% relative gain over static regret based algorithms on popular benchmarks.
Semi-bandit Optimization in the Dispersed Setting
Balcan, Maria-Florina, Dick, Travis, Pegden, Wesley
In this work, we study the problem of online optimization of piecewise Lipschitz functions with semi-bandit feedback. This challenging class of non-convex optimization problems often arises in algorithm selection problems for combinatorial settings, where the goal is to find the best algorithm from a large algorithm family for a specific application domain. In these settings, each evaluation of the loss functions in the optimization problem can be computationally expensive, often requiring the learner to run a combinatorial algorithm to measure its performance. Combined with the fact that small differences between similar algorithms in the family can lead to cascading changes in algorithm behavior, efficient online optimization in these settings is a challenging problem. However, we show that in many applications, evaluating the loss function for one algorithm choice can sometimes reveal the loss for a range of similar algorithms, essentially for free. We develop online optimization algorithms capable of using this kind of extra information by working in the semi-bandit feedback setting. Our algorithms achieve regret bounds that are essentially as good as algorithms under full-information feedback and are significantly more computationally efficient. We apply our semi-bandit optimization results to obtain online algorithm selection procedures for two rich families of combinatorial algorithms. We provide the first provable guarantees for online algorithm selection for clustering problems using a family of clustering algorithms containing classic linkage procedures. We also show how to select algorithms from a family of greedy knapsack algorithms with simultaneously lower computational complexity and stronger regret bounds than the best algorithm selection procedures from prior work.
Data-Driven Clustering via Parameterized Lloyd's Families
Balcan, Maria-Florina F., Dick, Travis, White, Colin
Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd's algorithm. The performance of a given clustering algorithm depends on the specific application at hand, and this may not be known up front. For example, a "typical instance" may vary depending on the application, and different clustering heuristics perform differently depending on the instance. In this paper, we define an infinite family of algorithms generalizing Lloyd's algorithm, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-first traversal algorithm. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal clustering algorithm from the class. We show the best parameters vary significantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians. Our learned algorithms never perform worse than k-means++, and on some datasets we see significant improvements.
Data-Driven Clustering via Parameterized Lloyd's Families
Balcan, Maria-Florina F., Dick, Travis, White, Colin
Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd's algorithm. The performance of a given clustering algorithm depends on the specific application at hand, and this may not be known up front. For example, a "typical instance" may vary depending on the application, and different clustering heuristics perform differently depending on the instance. In this paper, we define an infinite family of algorithms generalizing Lloyd's algorithm, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-first traversal algorithm. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal clustering algorithm from the class. We show the best parameters vary significantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians. Our learned algorithms never perform worse than k-means++, and on some datasets we see significant improvements.
Envy-Free Classification
Balcan, Maria-Florina, Dick, Travis, Noothigattu, Ritesh, Procaccia, Ariel D.
In classic fair division problems such as cake cutting and rent division, envy-freeness requires that each individual (weakly) prefer his allocation to anyone else's. On a conceptual level, we argue that envy-freeness also provides a compelling notion of fairness for classification tasks. Our technical focus is the generalizability of envy-free classification, i.e., understanding whether a classifier that is envy free on a sample would be almost envy free with respect to the underlying distribution with high probability. Our main result establishes that a small sample is sufficient to achieve such guarantees, when the classifier in question is a mixture of deterministic classifiers that belong to a family of low Natarajan dimension.
Data-Driven Clustering via Parameterized Lloyd's Families
Balcan, Maria-Florina, Dick, Travis, White, Colin
Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd's algorithm. The performance of a given clustering algorithm depends on the specific application at hand, and this may not be known up front. For example, a "typical instance" may vary depending on the application, and different clustering heuristics perform differently depending on the instance. In this paper, we define an infinite family of algorithms generalizing Lloyd's algorithm, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-first traversal algorithm. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal clustering algorithm from the class. We show the best parameters vary significantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians. Our learned algorithms never perform worse than k-means++, and on some datasets we see significant improvements.
Learning to Branch
Balcan, Maria-Florina, Dick, Travis, Sandholm, Tuomas, Vitercik, Ellen
Tree search algorithms, such as branch-and-bound, are the most widely used tools for solving combinatorial and nonconvex problems. For example, they are the foremost method for solving (mixed) integer programs and constraint satisfaction problems. Tree search algorithms recursively partition the search space to find an optimal solution. In order to keep the tree size small, it is crucial to carefully decide, when expanding a tree node, which question (typically variable) to branch on at that node in order to partition the remaining space. Numerous partitioning techniques (e.g., variable selection) have been proposed, but there is no theory describing which technique is optimal. We show how to use machine learning to determine an optimal weighting of any set of partitioning procedures for the instance distribution at hand using samples from the distribution. We provide the first sample complexity guarantees for tree search algorithm configuration. These guarantees bound the number of samples sufficient to ensure that the empirical performance of an algorithm over the samples nearly matches its expected performance on the unknown instance distribution. This thorough theoretical investigation naturally gives rise to our learning algorithm. Via experiments, we show that learning an optimal weighting of partitioning procedures can dramatically reduce tree size, and we prove that this reduction can even be exponential. Through theory and experiments, we show that learning to branch is both practical and hugely beneficial.
Label Efficient Learning by Exploiting Multi-Class Output Codes
Balcan, Maria Florina (Carnegie Mellon University) | Dick, Travis (Carnegie Mellon University) | Mansour, Yishay (Microsoft Research and Tel Aviv University)
We present a new perspective on the popular multi-class algorithmic techniques of one-vs-all and error correcting output codes. Rather than studying the behavior of these techniques for supervised learning, we establish a connection between the success of these methods and the existence of label-efficient learning procedures. We show that in both the realizable and agnostic cases, if output codes are successful at learning from labeled data, they implicitly assume structure on how the classes are related. By making that structure explicit, we design learning algorithms to recover the classes with low label complexity. We provide results for the commonly studied cases of one-vs-all learning and when the codewords of the classes are well separated. We additionally consider the more challenging case where the codewords are not well separated, but satisfy a boundary features condition that captures the natural intuition that every bit of the codewords should be significant.
Data Driven Resource Allocation for Distributed Learning
Dick, Travis (Carnegie Mellon University) | Li, Mu (Carnegie Mellon University ) | Pillutla, Venkata Krishna (University of Washington) | White, Colin (Carnegie Mellon University) | Balcan, Maria Florina (Carnegie Mellon University) | Smola, Alex (Carnegie Mellon University and AWS Deep Learning)
In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik and Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.
Data Driven Resource Allocation for Distributed Learning
Dick, Travis, Li, Mu, Pillutla, Venkata Krishna, White, Colin, Balcan, Maria Florina, Smola, Alex
In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.