Goto

Collaborating Authors

 Narasimhan, Harikrishna


Churn Reduction via Distillation

arXiv.org Machine Learning

In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive ``churn''). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoided with limited impact on model accuracy. In this paper, we show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines on a wide range of datasets and model architectures, including fully-connected networks, convolutional networks, and transformers.


Optimizing Black-box Metrics with Iterative Example Weighting

arXiv.org Machine Learning

We consider learning to optimize a classification metric defined by a black-box function of the confusion matrix. Such black-box learning settings are ubiquitous, for example, when the learner only has query access to the metric of interest, or in noisy-label and domain adaptation applications where the learner must evaluate the metric via performance evaluation using a small validation sample. Our approach is to adaptively learn example weights on the training dataset such that the resulting weighted objective best approximates the metric on the validation sample. We show how to model and estimate the example weights and use them to iteratively post-shift a pre-trained class probability estimator to construct a classifier. We also analyze the resulting procedure's statistical properties. Experiments on various label noise, domain shift, and fair classification setups confirm that our proposal is better than the individual state-of-the-art baselines for each application.


Distilling Double Descent

arXiv.org Artificial Intelligence

Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides \emph{hard} labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches. Our explanation for this phenomenon is based on recent work on "double descent". It has been observed that, once a model's complexity roughly exceeds the amount required to memorize the training data, increasing the complexity \emph{further} can, counterintuitively, result in \emph{better} generalization. Researchers have identified several settings in which it takes place, while others have made various attempts to explain it (thus far, with only partial success). In contrast, we avoid these questions, and instead seek to \emph{exploit} this phenomenon by demonstrating that a highly-overparameterized teacher can avoid overfitting via double descent, while a student trained on a larger independent dataset labeled by this teacher will avoid overfitting due to the size of its training set.


Quadratic Metric Elicitation with Application to Fairness

arXiv.org Machine Learning

Given a classification problem, which performance metric should the classifier optimize? This question is often faced by practitioners while developing machine learning solutions. For example, consider cancer diagnosis where the doctor applies a cost-sensitive predictive model to classify patients into cancer categories [53, 56]. Although it is clear that the chosen costs directly determine the model decisions and thus patient outcomes, it is not clear how to quantify expert intuition into precise quantitative cost tradeoffs, i.e. the performance metric. Indeed this is also true for a variety of other domains where picking the right metric is a critical challenge [8]. Hiranandani et al. [16, 17] addressed this issue by formalizing the problem of Metric Elicitation (ME), where the goal is to estimate a performance metric using preference feedback from a user. The motivation is that by employing metrics that reflect a user's innate tradeoffs, one can learn models that best capture the user preferences [16].


Fair Performance Metric Elicitation

arXiv.org Machine Learning

Machine learning models are increasingly employed for critical decision-making tasks such as hiring and sentencing [44, 3, 11, 14, 31]. Yet, it is increasingly evident that automated decision-making is susceptible to bias, whereby decisions made by the algorithm are unfair to certain subgroups [5, 3, 10, 8, 31]. To this end, a wide variety of group fairness metrics have been proposed - all to reduce discrimination and bias from automated decision-making [25, 13, 17, 29, 49, 32]. However, a dearth of formal principles for selecting the most appropriate metric has highlighted the confusion of experts, practitioners, and end users in deciding which group fairness metric to employ [53]. This is further exacerbated by the observation that common metrics often lead to contradictory outcomes [29]. While the problem of selecting an appropriate fairness metric has gained prominence in recent years [17, 32, 53], it perhaps best understood as a special case of the task of choosing evaluation metrics in machine learning.


Optimizing Generalized Rate Metrics through Game Equilibrium

arXiv.org Machine Learning

We present a general framework for solving a large class of learning problems with non-linear functions of classification rates. This includes problems where one wishes to optimize a non-decomposable performance metric such as the F-measure or G-mean, and constrained training problems where the classifier needs to satisfy non-linear rate constraints such as predictive parity fairness, distribution divergences or churn ratios. We extend previous two-player game approaches for constrained optimization to a game between three players to decouple the classifier rates from the non-linear objective, and seek to find an equilibrium of the game. Our approach generalizes many existing algorithms, and makes possible new algorithms with more flexibility and tighter handling of non-linear rate constraints. We provide convergence guarantees for convex functions of rates, and show how our methodology can be extended to handle sums of ratios of rates. Experiments on different fairness tasks confirm the efficacy of our approach.


Pairwise Fairness for Ranking and Regression

arXiv.org Machine Learning

We present pairwise metrics of fairness for ranking and regression models that form analogues of statistical fairness notions such as equal opportunity or equal accuracy, as well as statistical parity. Our pairwise formulation supports both discrete protected groups, and continuous protected attributes. We show that the resulting training problems can be efficiently and effectively solved using constrained optimization and robust optimization techniques based on two player game algorithms developed for fair classification. Experiments illustrate the broad applicability and trade-offs of these methods.


Optimal Auctions through Deep Learning

arXiv.org Artificial Intelligence

Optimal auction design is one of the cornerstones of economic theory. It is of great practical importance, as auctions are used across industries and by the public sector to organize the sale of their products and services. Concrete examples are the US FCC Incentive Auction, the sponsored search auctions conducted by web search engines such as Google, or the auctions run on platforms such as eBay. In the standard independent private valuations model, each bidder has a valuation function over subsets of items, drawn independently from not necessarily identical distributions. It is assumed that the auctioneer knows the distributions and can (and will) use this information in designing the auction. A major difficulty in designing auctions is that valuations are private and bidders need to be incentivized to report their valuations truthfully. The goal is to learn an incentive compatible auction that maximizes revenue. We would like to thank Yang Cai, Vincent Conitzer, Yannai Gonczarowski, Constantinos Daskalakis, Glenn Ellison, Sergiu Hart, Ron Lavi, Kevin Leyton-Brown, Shengwu Li, Noam Nisan, Parag Pathak, Alexander Rush, Karl Schlag, Alex Wolitzky, participants in the Economics and Computation Reunion Workshop at the Simons Institute, the NIPS'17 Workshop on Learning in the Presence of Strategic Behavior, a Dagstuhl Workshop on Computational Learning Theory meets Game Theory, the EC'18 Workshop on Algorithmic Game Theory and Data Science, the Annual Congress of the German Economic Association, participants in seminars at LSE, Technion, Hebrew, Google, HBS, MIT, and the anonymous reviewers on earlier versions of this paper for their helpful feedback. The first version of this paper was posted on arXiv on June 12, 2017.


Online Optimization Methods for the Quantification Problem

arXiv.org Machine Learning

The estimation of class prevalence, i.e., the fraction of a population that belongs to a certain class, is a very useful tool in data analytics and learning, and finds applications in many domains such as sentiment analysis, epidemiology, etc. For example, in sentiment analysis, the objective is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather estimate the overall distribution of positive and negative sentiments during an event window. A popular way of performing the above task, often dubbed quantification, is to use supervised learning to train a prevalence estimator from labeled data. Contemporary literature cites several performance measures used to measure the success of such prevalence estimators. In this paper we propose the first online stochastic algorithms for directly optimizing these quantification-specific performance measures. We also provide algorithms that optimize hybrid performance measures that seek to balance quantification and classification performance. Our algorithms present a significant advancement in the theory of multivariate optimization and we show, by a rigorous theoretical analysis, that they exhibit optimal convergence. We also report extensive experiments on benchmark and real data sets which demonstrate that our methods significantly outperform existing optimization techniques used for these performance measures.


Support Vector Algorithms for Optimizing the Partial Area Under the ROC Curve

arXiv.org Machine Learning

The area under the ROC curve (AUC) is a widely used performance measure in machine learning. Increasingly, however, in several applications, ranging from ranking to biometric screening to medicine, performance is measured not in terms of the full area under the ROC curve, but in terms of the \emph{partial} area under the ROC curve between two false positive rates. In this paper, we develop support vector algorithms for directly optimizing the partial AUC between any two false positive rates. Our methods are based on minimizing a suitable proxy or surrogate objective for the partial AUC error. In the case of the full AUC, one can readily construct and optimize convex surrogates by expressing the performance measure as a summation of pairwise terms. The partial AUC, on the other hand, does not admit such a simple decomposable structure, making it more challenging to design and optimize (tight) convex surrogates for this measure. Our approach builds on the structural SVM framework of Joachims (2005) to design convex surrogates for partial AUC, and solves the resulting optimization problem using a cutting plane solver. Unlike the full AUC, where the combinatorial optimization needed in each iteration of the cutting plane solver can be decomposed and solved efficiently, the corresponding problem for the partial AUC is harder to decompose. One of our main contributions is a polynomial time algorithm for solving the combinatorial optimization problem associated with partial AUC. We also develop an approach for optimizing a tighter non-convex hinge loss based surrogate for the partial AUC using difference-of-convex programming. Our experiments on a variety of real-world and benchmark tasks confirm the efficacy of the proposed methods.