Regression
Lexicographically Fair Learning: Algorithms and Generalization
Diana, Emily, Gill, Wesley, Globus-Harris, Ira, Kearns, Michael, Roth, Aaron, Sharifi-Malvajerdi, Saeed
We extend the notion of minimax fairness in supervised learning problems to its natural conclusion: lexicographic minimax fairness (or lexifairness for short). Informally, given a collection of demographic groups of interest, minimax fairness asks that the error of the group with the highest error be minimized. Lexifairness goes further and asks that amongst all minimax fair solutions, the error of the group with the second highest error should be minimized, and amongst all of those solutions, the error of the group with the third highest error should be minimized, and so on. Despite its naturalness, correctly defining lexifairness is considerably more subtle than minimax fairness, because of inherent sensitivity to approximation error. We give a notion of approximate lexifairness that avoids this issue, and then derive oracle-efficient algorithms for finding approximately lexifair solutions in a very general setting. When the underlying empirical risk minimization problem absent fairness constraints is convex (as it is, for example, with linear and logistic regression), our algorithms are provably efficient even in the worst case. Finally, we show generalization bounds -- approximate lexifairness on the training sample implies approximate lexifairness on the true distribution with high probability. Our ability to prove generalization bounds depends on our choosing definitions that avoid the instability of naive definitions.
Efficient Designs of SLOPE Penalty Sequences in Finite Dimension
In linear regression, SLOPE is a new convex analysis method that generalizes the Lasso via the sorted L1 penalty: larger fitted coefficients are penalized more heavily. This magnitude-dependent regularization requires an input of penalty sequence $\lambda$, instead of a scalar penalty as in the Lasso case, thus making the design extremely expensive in computation. In this paper, we propose two efficient algorithms to design the possibly high-dimensional SLOPE penalty, in order to minimize the mean squared error. For Gaussian data matrices, we propose a first order Projected Gradient Descent (PGD) under the Approximate Message Passing regime. For general data matrices, we present a zero-th order Coordinate Descent (CD) to design a sub-class of SLOPE, referred to as the k-level SLOPE. Our CD allows a useful trade-off between the accuracy and the computation speed. We demonstrate the performance of SLOPE with our designs via extensive experiments on synthetic data and real-world datasets.
The MELODIC family for simultaneous binary logistic regression in a reduced space
de Rooij, Mark, Groenen, Patrick J. F.
Logistic regression is a commonly used method for binary classification. Researchers often have more than a single binary response variable and simultaneous analysis is beneficial because it provides insight into the dependencies among response variables as well as between the predictor variables and the responses. Moreover, in such a simultaneous analysis the equations can lend each other strength, which might increase predictive accuracy. In this paper, we propose the MELODIC family for simultaneous binary logistic regression modeling. In this family, the regression models are defined in a Euclidean space of reduced dimension, based on a distance rule. The model may be interpreted in terms of logistic regression coefficients or in terms of a biplot. We discuss a fast iterative majorization (or MM) algorithm for parameter estimation. Two applications are shown in detail: one relating personality characteristics to drug consumption profiles and one relating personality characteristics to depressive and anxiety disorders. We present a thorough comparison of our MELODIC family with alternative approaches for multivariate binary data.
Conditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression
Park, Junhyung, Shalit, Uri, Schรถlkopf, Bernhard, Muandet, Krikamol
We propose to analyse the conditional distributional treatment effect (CoDiTE), which, in contrast to the more common conditional average treatment effect (CATE), is designed to encode a treatment's distributional aspects beyond the mean. We first introduce a formal definition of the CoDiTE associated with a distance function between probability measures. Then we discuss the CoDiTE associated with the maximum mean discrepancy via kernel conditional mean embeddings, which, coupled with a hypothesis test, tells us whether there is any conditional distributional effect of the treatment. Finally, we investigate what kind of conditional distributional effect the treatment has, both in an exploratory manner via the conditional witness function, and in a quantitative manner via U-statistic regression, generalising the CATE to higher-order moments. Experiments on synthetic, semi-synthetic and real datasets demonstrate the merits of our approach.
Covid-19 risk factors: Statistical learning from German healthcare claims data
Jucknewitz, Roland, Weidinger, Oliver, Schramm, Anja
We analyse prior risk factors for severe, critical or fatal courses of Covid-19 based on a retrospective cohort using claims data of the AOK Bayern. As our main methodological contribution, we avoid prior grouping and pre-selection of candidate risk factors. Instead, fine-grained hierarchical information from medical classification systems for diagnoses, pharmaceuticals and procedures are used, resulting in more than 33,000 covariates. Our approach has better predictive ability than well-specified morbidity groups but does not need prior subject-matter knowledge. The methodology and estimated coefficients are made available to decision makers to prioritize protective measures towards vulnerable subpopulations and to researchers who like to adjust for a large set of confounders in studies of individual risk factors.
Don't Just Blame Over-parametrization for Over-confidence: Theoretical Analysis of Calibration in Binary Classification
Bai, Yu, Mei, Song, Wang, Huan, Xiong, Caiming
Modern machine learning models such as deep neural networks with high accuracy tend to be miscalibrated: The predicted top probability (confidence) does not reflect the actual accuracy of the model, and tends to be over-confident. For example, a WideResNet 32 on CIFAR100 has on average a predicted top probability of 87%, while the actual test accuracy is only 72% (Guo et al., 2017). As the confidence is often comprehended as an estimate of the true accuracy, such over-confidence could be dangerous, especially in risk-sensitive domains such as medical AI (Begoli et al., 2019), self-driving cars (Michelmore et al., 2018), and so on. To address this issue, there is a growing line of research on improving the calibration of models, by either performing recalibration of well-trained models to adjust the confidence scores (Platt et al., 1999; Zadrozny and Elkan; Naeini et al., 2015; Guo et al., 2017), or by averaging the predictions over multiple models to make the confidence scores more accurate (Lakshminarayanan et al., 2016; Gal and Ghahramani, 2016). These methods in general can reduce the over-confidence and improve the calibration of the model, while preserving (or even improving) the model's accuracy (Ovadia et al., 2019). Despite these progresses, the more fundamental question of why such over-confidence happens for vanillaly trained models remains not satisfactorily understood.
Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy
Sen, Rajat, Rakhlin, Alexander, Ying, Lexing, Kidambi, Rahul, Foster, Dean, Hill, Daniel, Dhillon, Inderjit
Motivated by modern applications, such as online advertisement and recommender systems, we study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select $k$ arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of $O(k\sqrt{(A-k+1)T \log (|\mathcal{F}|T)})$, where $A$ is the total number of arms and $\mathcal{F}$ is the class containing the regression function, while only requiring $\tilde{O}(A)$ computation per time step. In the extreme setting, where the total number of arms can be in the millions, we propose a practically-motivated arm hierarchy model that induces a certain structure in mean rewards to ensure statistical and computational efficiency. The hierarchical structure allows for an exponential reduction in the number of relevant arms for each context, thus resulting in a regret guarantee of $O(k\sqrt{(\log A-k+1)T \log (|\mathcal{F}|T)})$. Finally, we implement our algorithm using a hierarchical linear function class and show superior performance with respect to well-known benchmarks on simulated bandit feedback experiments using extreme multi-label classification datasets. On a dataset with three million arms, our reduction scheme has an average inference time of only 7.9 milliseconds, which is a 100x improvement.
Deep Learning Prerequisites: Logistic Regression in Python
Data science techniques for professionals and students - learn the theory behind logistic regression and code in Python Created by Lazy Programmer Inc. English [Auto-generated], Portuguese [Auto-generated], 1 more Created by Lazy Programmer Inc. English [Auto-generated], Portuguese [Auto-generated], 1 more Created by Lazy Programmer Inc.
The Predictive Normalized Maximum Likelihood for Over-parameterized Linear Regression with Norm Constraint: Regret and Double Descent
A fundamental tenet of learning theory is that a trade-off exists between the complexity of a prediction rule and its ability to generalize. The double-decent phenomenon shows that modern machine learning models do not obey this paradigm: beyond the interpolation limit, the test error declines as model complexity increases. We investigate over-parameterization in linear regression using the recently proposed predictive normalized maximum likelihood (pNML) learner which is the min-max regret solution for individual data. We derive an upper bound of its regret and show that if the test sample lies mostly in a subspace spanned by the eigenvectors associated with the large eigenvalues of the empirical correlation matrix of the training data, the model generalizes despite its over-parameterized nature. We demonstrate the use of the pNML regret as a point-wise learnability measure on synthetic data and that it can successfully predict the double-decent phenomenon using the UCI dataset.
Sparse Bayesian Causal Forests for Heterogeneous Treatment Effects Estimation
Caron, Alberto, Baio, Gianluca, Manolopoulou, Ioanna
This paper develops a sparsity-inducing version of Bayesian Causal Forests, a recently proposed nonparametric causal regression model that employs Bayesian Additive Regression Trees and is specifically designed to estimate heterogeneous treatment effects using observational data. The sparsity-inducing component we introduce is motivated by empirical studies where the number of pre-treatment covariates available is non-negligible, leading to different degrees of sparsity underlying the surfaces of interest in the estimation of individual treatment effects. The extended version presented in this work, which we name Sparse Bayesian Causal Forest, is equipped with an additional pair of priors allowing the model to adjust the weight of each covariate through the corresponding number of splits in the tree ensemble. These priors improve the model's adaptability to sparse settings and allow to perform fully Bayesian variable selection in a framework for treatment effects estimation, and thus to uncover the moderating factors driving heterogeneity. In addition, the method allows prior knowledge about the relevant confounding pre-treatment covariates and the relative magnitude of their impact on the outcome to be incorporated in the model. We illustrate the performance of our method in simulated studies, in comparison to Bayesian Causal Forest and other state-of-the-art models, to demonstrate how it scales up with an increasing number of covariates and how it handles strongly confounded scenarios. Finally, we also provide an example of application using real-world data.