Regression
A Minimax Approach to Supervised Learning
Given a task of predicting Y from X, a loss function L, and a set of probability distributions Gamma on (X,Y), what is the optimal decision rule minimizing the worst-case expected loss over Gamma? In this paper, we address this question by introducing a generalization of the maximum entropy principle. Applying this principle to sets of distributions with marginal on X constrained to be the empirical marginal, we provide a minimax interpretation of the maximum likelihood problem over generalized linear models as well as some popular regularization schemes. For quadratic and logarithmic loss functions we revisit well-known linear and logistic regression models. Moreover, for the 0-1 loss we derive a classifier which we call the minimax SVM.
Hunting for Discriminatory Proxies in Linear Regression Models
Yeom, Samuel, Datta, Anupam, Fredrikson, Matt
A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a business necessity''.
Structured Learning via Logistic Regression
A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is "smoothed" through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an "oracle" exists to minimize a logistic loss. Papers published at the Neural Information Processing Systems Conference.
Efficient Sublinear-Regret Algorithms for Online Sparse Linear Regression with Limited Observation
Ito, Shinji, Hatano, Daisuke, Sumita, Hanna, Yabe, Akihiro, Fukunaga, Takuro, Kakimura, Naonori, Kawarabayashi, Ken-Ichi
Online sparse linear regression is the task of applying linear regression analysis to examples arriving sequentially subject to a resource constraint that a limited number of features of examples can be observed. Despite its importance in many practical applications, it has been recently shown that there is no polynomial-time sublinear-regret algorithm unless NP$\subseteq$BPP, and only an exponential-time sublinear-regret algorithm has been found. In this paper, we introduce mild assumptions to solve the problem. In addition, thorough experiments with publicly available data demonstrate that our algorithms outperform other known algorithms. Papers published at the Neural Information Processing Systems Conference.
Semi-Supervised Factored Logistic Regression for High-Dimensional Neuroimaging Data
Bzdok, Danilo, Eickenberg, Michael, Grisel, Olivier, Thirion, Bertrand, Varoquaux, Gael
Existing neuroimaging methods typically perform either discovery of unknown neural structure or testing of neural structure associated with mental tasks. We therefore propose to blend representation modelling and task classification into a unified statistical learning problem. A multinomial logistic regression is introduced that is constrained by factored coefficients and coupled with an autoencoder. We show that this approach yields more accurate and interpretable neural models of psychological tasks in a reference dataset, as well as better generalization to other datasets. Papers published at the Neural Information Processing Systems Conference.
Optimal Subsampling with Influence Functions
Subsampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the concept of an asymptotically linear estimator and the associated influence function leads to asymptotically optimal sampling probabilities for a wide class of popular models. This is the only tight optimality result for subsampling we are aware of as other methods only provide probabilistic error bounds or optimal rates. Furthermore, for linear regression models, which have well-studied procedures for non-uniform subsampling, we empirically show our optimal influence function based method outperforms previous approaches even when using approximations to the optimal probabilities. Papers published at the Neural Information Processing Systems Conference.
A Residual Bootstrap for High-Dimensional Regression with Near Low-Rank Designs
We study the residual bootstrap (RB) method in the context of high-dimensional linear regression. When regression coefficients are estimated via least squares, classical results show that RB consistently approximates the laws of contrasts, provided that $p\ll n$, where the design matrix is of size $n\times p$. Up to now, relatively little work has considered how additional structure in the linear model may extend the validity of RB to the setting where $p/n\asymp 1$. In this setting, we propose a version of RB that resamples residuals obtained from ridge regression. Our main structural assumption on the design matrix is that it is nearly low rank --- in the sense that its singular values decay according to a power-law profile.
Unbiased estimates for linear regression via volume sampling
Derezinski, Michal, Warmuth, Manfred K. K.
Given a full rank matrix X with more columns than rows consider the task of estimating the pseudo inverse $X $ based on the pseudo inverse of a sampled subset of columns (of size at least the number of rows). We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times $X X { \top}$. Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$. We assume labels are expensive and we are only given the labels for the small subset of columns we sample from $X$.
On Sparse Gaussian Chain Graph Models
McCarter, Calvin, Kim, Seyoung
In this paper, we address the problem of learning the structure of Gaussian chain graph models in a high-dimensional space. Chain graph models are generalizations of undirected and directed graphical models that contain a mixed set of directed and undirected edges. While the problem of sparse structure learning has been studied extensively for Gaussian graphical models and more recently for conditional Gaussian graphical models (CGGMs), there has been little previous work on the structure recovery of Gaussian chain graph models. We consider linear regression models and a re-parameterization of the linear regression models using CGGMs as building blocks of chain graph models. We argue that when the goal is to recover model structures, there are many advantages of using CGGMs as chain component models over linear regression models, including convexity of the optimization problem, computational efficiency, recovery of structured sparsity, and ability to leverage the model structure for semi-supervised learning.
Analytic solution and stationary phase approximation for the Bayesian lasso and elastic net
The lasso and elastic net linear regression models impose a double-exponential prior distribution on the model parameters to achieve regression shrinkage and variable selection, allowing the inference of robust models from large data sets. However, there has been limited success in deriving estimates for the full posterior distribution of regression coefficients in these models, due to a need to evaluate analytically intractable partition function integrals. Here, the Fourier transform is used to express these integrals as complex-valued oscillatory integrals over "regression frequencies". This results in an analytic expansion and stationary phase approximation for the partition functions of the Bayesian lasso and elastic net, where the non-differentiability of the double-exponential prior has so far eluded such an approach. Use of this approximation leads to highly accurate numerical estimates for the expectation values and marginal posterior distributions of the regression coefficients, and allows for Bayesian inference of much higher dimensional models than previously possible. Papers published at the Neural Information Processing Systems Conference.