Performance Analysis
Sparse Approximate Cross-Validation for High-Dimensional GLMs
Stephenson, William, Broderick, Tamara
Leave-one-out cross validation (LOOCV) can be particularly accurate among CV variants for estimating out-of-sample error. Unfortunately, LOOCV requires re-fitting a model $N$ times for a dataset of size $N$. To avoid this prohibitive computational expense, a number of authors have proposed approximations to LOOCV. These approximations work well when the unknown parameter is of small, fixed dimension but suffer in high dimensions; they incur a running time roughly cubic in the dimension, and, in fact, we show their accuracy significantly deteriorates in high dimensions. We demonstrate that these difficulties can be surmounted in $\ell_1$-regularized generalized linear models when we assume that the unknown parameter, while high dimensional, has a small support. In particular, we show that, under interpretable conditions, the support of the recovered parameter does not change as each datapoint is left out. This result implies that the previously proposed heuristic of only approximating CV along the support of the recovered parameter has running time and error that scale with the (small) support size even when the full dimension is large. Experiments on synthetic and real data support the accuracy of our approximations.
Optimized Score Transformation for Fair Classification
Wei, Dennis, Ramamurthy, Karthikeyan Natesan, Calmon, Flavio du Pin
Recent years have seen a surge of interest in the problem of fair classification, which is concerned with disparities in classification output or performance when conditioned on a protected attribute such as race or gender, or ethnicity. Many measures of fairness have been introduced [1-14] and fairness-enhancing interventions have been proposed to mitigate these disparities [15]. Roughly categorized, these interventions either (i) change data used to train a classifier (pre-processing) [16-20], (ii) change a classifier's output (post-processing) [4, 21-24], or (iii) directly change a classification model to ensure fairness (in-processing) [5, 25-32]. This paper places more emphasis on probabilistic classification in which the outputs of interest are predicted probabilities of belonging to one of the classes, often referred to as scores, as opposed to binary predictions. Scores are desirable because they indicate confidences in predictions. We propose an optimization formulation for transforming scores to satisfy fairness constraints while minimizing the loss in utility. The formulation accommodates any fairness criteria that can be expressed as linear inequalities involving conditional means of scores, including variants of statistical parity (SP) [1] and equalized odds (EO) [4, 5]. We derive a closed-form expression for the optimal transformed scores and a convex dual optimization problem for the Lagrange multipliers that parametrize the transformation.
High Dimensional Classification via Empirical Risk Minimization: Improvements and Optimality
In this article, we investigate a family of classification algorithms defined by the principle of empirical risk minimization, in the high dimensional regime where the feature dimension $p$ and data number $n$ are both large and comparable. Based on recent advances in high dimensional statistics and random matrix theory, we provide under mixture data model a unified stochastic characterization of classifiers learned with different loss functions. Our results are instrumental to an in-depth understanding as well as practical improvements on this fundamental classification approach. As the main outcome, we demonstrate the existence of a universally optimal loss function which yields the best high dimensional performance at any given $n/p$ ratio.
Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness
Ensemble approaches for uncertainty estimation have recently been applied to the tasks of misclassification detection, out-of-distribution input detection and adversarial attack detection. Prior Networks have been proposed as an approach to efficiently emulating an ensemble of models by parameterising a Dirichlet prior distribution over output distributions. These models have been shown to outperform ensemble approaches, such as Monte-Carlo Dropout, on the task of out-of-distribution input detection. However, scaling Prior Networks to complex datasets with many classes is difficult using the training criteria originally proposed. This paper makes two contributions. Firstly, we show that the appropriate training criterion for Prior Networks is the reverse KL-divergence between Dirichlet distributions. Using this loss we successfully train Prior Networks on image classification datasets with up to 200 classes and improve out-of-distribution detection performance. Secondly, taking advantage of the new training criterion, this paper investigates using Prior Networks to detect adversarial attacks. It is shown that the construction of successful adaptive whitebox attacks, which affect the prediction and evade detection, against Prior Networks trained on CIFAR-10 and CIFAR-100 takes a greater amount of computational effort than against standard neural networks, adversarially trained neural networks and dropout-defended networks.
NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields
In the world of Natural Language Processing (NLP), the most basic models are based on Bag of Words. But such models fail to capture the syntactic relations between words. For example, suppose we build a sentiment analyser based on only Bag of Words. Such a model will not be able to capture the difference between "I like you", where "like" is a verb with a positive sentiment, and "I am like you", where "like" is a preposition with a neutral sentiment. So this leaves us with a question -- how do we improve on this Bag of Words technique?
Weakly supervised training of pixel resolution segmentation models on whole slide images
We present a novel approach to train pixel resolution segmentation models on whole slide images in a weakly supervised setup. The model is trained to classify patches extracted from slides. This leads the training to be made under noisy labeled data. We solve the problem with two complementary strategies. First, the patches are sampled online using the model's knowledge by focusing on regions where the model's confidence is higher. Second, we propose an extension of the KL divergence that is robust to noisy labels. Our preliminary experiment on CAMELYON 16 data set show promising results.
Semi-Unsupervised Lifelong Learning for Sentiment Classification: Less Manual Data Annotation and More Self-Studying
Hong, Xianbin, Pal, Gautam, Guan, Sheng-Uei, Wong, Prudence, Liu, Dawei, Man, Ka Lok, Huang, Xin
Lifelong machine learning is a novel machine learning paradigm which can continually accumulate knowledge during learning. The knowledge extracting and reusing abilities enable the lifelong machine learning to solve the related problems. The traditional approaches like Na\"ive Bayes and some neural network based approaches only aim to achieve the best performance upon a single task. Unlike them, the lifelong machine learning in this paper focuses on how to accumulate knowledge during learning and leverage them for further tasks. Meanwhile, the demand for labelled data for training also is significantly decreased with the knowledge reusing. This paper suggests that the aim of the lifelong learning is to use less labelled data and computational cost to achieve the performance as well as or even better than the supervised learning.
Modeling Uncertainty by Learning a Hierarchy of Deep Neural Connections
Rohekar, Raanan Y., Gurwicz, Yaniv, Nisimov, Shami, Novik, Gal
Quantifying and measuring uncertainty in deep neural networks, despite recent important advances, is still an open problem. Bayesian neural networks are a powerful solution, where the prior over network weights is a design choice, often a normal distribution or other distribution encouraging sparsity. However, this prior is agnostic to the generative process of the input data, which might lead to unwarranted generalization for out-of-distribution tested data. We suggest treating the generative process of the input data as a confounder for the relation between the input and the discriminative function, thereby conditioning the prior of the network weights on the distribution of the input. We propose an algorithm for modeling this confounder through neural connectivity patterns. This approach is ultimately translated into a new deep architecture---a compact hierarchy of networks. We demonstrate that sampling networks from this hierarchy, proportionally to their posterior, is efficient and enables estimating various types of uncertainties. Empirical evaluations of our method demonstrate significant improvement compared to state-of-the-art calibration and out-of-distribution detection methods.
The cost-free nature of optimally tuning Tikhonov regularizers and other ordered smoothers
We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on $Q$-aggregation achieves the mean square error of the best estimator, up to a small error term no larger than $C\sigma^2$, where $\sigma^2$ is the noise level and $C>0$ is an absolute constant. Remarkably, the error term does not depend on the penalty matrix or the number of estimators as long as they share the same penalty matrix, i.e., it applies to any grid of tuning parameters, no matter how large the cardinality of the grid is. This reveals the surprising "cost-free" nature of optimally tuning Tikhonov regularizers, in striking contrast with the existing literature on aggregation of estimators where one typically has to pay a cost of $\sigma^2\log(M)$ where $M$ is the number of estimators in the family. The result holds, more generally, for any family of ordered linear smoothers. This encompasses Ridge regression as well as Principal Component Regression. The result is extended to the problem of tuning Tikhonov regularizers with different penalty matrices.
Exploiting Epistemic Uncertainty of Anatomy Segmentation for Anomaly Detection in Retinal OCT
Seeböck, Philipp, Orlando, José Ignacio, Schlegl, Thomas, Waldstein, Sebastian M., Bogunović, Hrvoje, Klimscha, Sophie, Langs, Georg, Schmidt-Erfurth, Ursula
Diagnosis and treatment guidance are aided by detecting relevant biomarkers in medical images. Although supervised deep learning can perform accurate segmentation of pathological areas, it is limited by requiring a-priori definitions of these regions, large-scale annotations, and a representative patient cohort in the training set. In contrast, anomaly detection is not limited to specific definitions of pathologies and allows for training on healthy samples without annotation. Anomalous regions can then serve as candidates for biomarker discovery. Knowledge about normal anatomical structure brings implicit information for detecting anomalies. We propose to take advantage of this property using bayesian deep learning, based on the assumption that epistemic uncertainties will correlate with anatomical deviations from a normal training set. A Bayesian U-Net is trained on a well-defined healthy environment using weak labels of healthy anatomy produced by existing methods. At test time, we capture epistemic uncertainty estimates of our model using Monte Carlo dropout. A novel post-processing technique is then applied to exploit these estimates and transfer their layered appearance to smooth blob-shaped segmentations of the anomalies. We experimentally validated this approach in retinal optical coherence tomography (OCT) images, using weak labels of retinal layers. Our method achieved a Dice index of 0.789 in an independent anomaly test set of age-related macular degeneration (AMD) cases. The resulting segmentations allowed very high accuracy for separating healthy and diseased cases with late wet AMD, dry geographic atrophy (GA), diabetic macular edema (DME) and retinal vein occlusion (RVO). Finally, we qualitatively observed that our approach can also detect other deviations in normal scans such as cut edge artifacts.