Goto

Collaborating Authors

 Regression


Truncated Linear Regression in High Dimensions

Neural Information Processing Systems

As in standard linear regression, in truncated linear regression, we are given access to observations (Ai, yi)i whose dependent variable equals yi Ai {\rm T} \cdot x * \etai, where x * is some fixed unknown vector of interest and \etai is independent noise; except we are only given an observation if its dependent variable yi lies in some "truncation set" S \subset \mathbb{R}. The goal is to recover x * under some favorable conditions on the Ai's and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering k-sparse n-dimensional vectors x * from m truncated samples, which attains an optimal \ell2 reconstruction error of O(\sqrt{(k \log n)/m}). As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, such as that which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaption of the LASSO optimization problem that accommodates truncation.


On the Optimal Weighted \ell_2 Regularization in Overparameterized Linear Regression

Neural Information Processing Systems

Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting \lambda_{\opt} for the ridge parameter \lambda and confirm the implicit \ell_2 regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that \lambda_{\opt} can be \textit{negative} in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when \vX and \vbeta_{\star} are both anisotropic. Finally, we determine the optimal weighting matrix \vSigma_w for both the ridgeless ( \lambda\to 0) and optimally regularized ( \lambda \lambda_{\opt}) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.


List-decodable Linear Regression

Neural Information Processing Systems

We give the first polynomial-time algorithm for robust regression in the list-decodable setting where an adversary can corrupt a greater than 1/2 fraction of examples. It outputs a list L of size O(1/\alpha) - a fixed constant - that contains an \ell that is close to \ell . Our algorithm succeeds whenever the inliers are chosen from a certifiably anti-concentrated distribution D. In particular, this gives a (d/\alpha) {O(1/\alpha 8)} time algorithm to find a O(1/\alpha) size list when the inlier distribution is a standard Gaussian. For discrete product distributions that are anti-concentrated only in regular directions, we give an algorithm that achieves similar guarantee under the promise that \ell * has all coordinates of the same magnitude. To complement our result, we prove that the anti-concentration assumption on the inliers is information-theoretically necessary. To solve the problem we introduce a new framework for list-decodable learning that strengthens the identifiability to algorithms'' paradigm based on the sum-of-squares method.


Partitioning Structure Learning for Segmented Linear Regression Trees

Neural Information Processing Systems

This paper proposes a partitioning structure learning method for segmented linear regression trees (SLRT), which assigns linear predictors over the terminal nodes. The recursive partitioning process is driven by an adaptive split selection algorithm that maximizes, at each node, a criterion function based on a conditional Kendall's τ statistic that measures the rank dependence between the regressors and the fit- ted linear residuals. Theoretical analysis shows that the split selection algorithm permits consistent identification and estimation of the unknown segments. A suffi- ciently large tree is induced by applying the split selection algorithm recursively. Then the minimal cost-complexity tree pruning procedure is applied to attain the right-sized tree, that ensures (i) the nested structure of pruned subtrees and (ii) consistent estimation to the number of segments.


A Scalable Approach for Privacy-Preserving Collaborative Machine Learning

Neural Information Processing Systems

We consider a collaborative learning scenario in which multiple data-owners wish to jointly train a logistic regression model, while keeping their individual datasets private from the other parties. We propose COPML, a fully-decentralized training framework that achieves scalability and privacy-protection simultaneously. The key idea of COPML is to securely encode the individual datasets to distribute the computation load effectively across many parties and to perform the training computations as well as the model updates in a distributed manner on the securely encoded data. We provide the privacy analysis of COPML and prove its convergence. Furthermore, we experimentally demonstrate that COPML can achieve significant speedup in training over the benchmark protocols.


Overfitting Can Be Harmless for Basis Pursuit, But Only to a Degree

Neural Information Processing Systems

Recently, there have been significant interests in studying the so-called "double-descent" of the generalization error of linear regression models under the overparameterized and overfitting regime, with the hope that such analysis may provide the first step towards understanding why overparameterized deep neural networks (DNN) still generalize well. However, to date most of these studies focused on the min L2-norm solution that overfits the data. In contrast, in this paper we study the overfitting solution that minimizes the L1-norm, which is known as Basis Pursuit (BP) in the compressed sensing literature. Gaussian features, we show that for a large range of p up to a limit that grows exponentially with the number of samples n, with high probability the model error of BP is upper bounded by a value that decreases with p. To the best of our knowledge, this is the first analytical result in the literature establishing the double-descent of overfitting BP for finite n and p. Further, our results reveal significant differences between the double-descent of BP and min L2-norm solutions. Specifically, the double-descent upper-bound of BP is independent of the signal strength, and for high SNR and sparse models the descent-floor of BP can be much lower and wider than that of min L2-norm solutions.


Parameter-free HE-friendly Logistic Regression

Neural Information Processing Systems

Privacy in machine learning has been widely recognized as an essential ethical and legal issue, because the data used for machine learning may contain sensitive information. Homomorphic encryption has recently attracted attention as a key solution to preserve privacy in machine learning applications. However, current approaches on the training of encrypted machine learning have relied heavily on hyperparameter selection, which should be avoided owing to the extreme difficulty of conducting validation on encrypted data. In this study, we propose an effective privacy-preserving logistic regression method that is free from the approximation of the sigmoid function and hyperparameter selection. In our framework, a logistic regression model can be transformed into the corresponding ridge regression for the logit function. We provide a theoretical background for our framework by suggesting a new generalization error bound on the encrypted data.


Fair regression with Wasserstein barycenters

Neural Information Processing Systems

We study the problem of learning a real-valued function that satisfies the Demographic Parity constraint. It demands the distribution of the predicted output to be independent of the sensitive attribute. We consider the case that the sensitive attribute is available for prediction. We establish a connection between fair regression and optimal transport theory, based on which we derive a close form expression for the optimal fair predictor. Specifically, we show that the distribution of this optimum is the Wasserstein barycenter of the distributions induced by the standard regression function on the sensitive groups.


Estimation and Imputation in Probabilistic Principal Component Analysis with Missing Not At Random Data

Neural Information Processing Systems

Missing Not At Random (MNAR) values where the probability of having missing data may depend on the missing value itself, are notoriously difficult to account for in analyses, although very frequent in the data. One solution to handle MNAR data is to specify a model for the missing data mechanism, which makes inference or imputation tasks more complex. Furthermore, this implies a strong \textit{a priori} on the parametric form of the distribution. However, some works have obtained guarantees on the estimation of parameters in the presence of MNAR data, without specifying the distribution of missing data \citep{mohan2018estimation, tang2003analysis}. This is very useful in practice, but is limited to simple cases such as few self-masked MNAR variables in data generated according to linear regression models.


How Data Augmentation affects Optimization for Linear Regression

Neural Information Processing Systems

Though data augmentation has rapidly emerged as a key tool for optimization in modern machine learning, a clear picture of how augmentation schedules affect optimization and interact with optimization hyperparameters such as learning rate is nascent. In the spirit of classical convex optimization and recent work on implicit bias, the present work analyzes the effect of augmentation on optimization in the simple convex setting of linear regression with MSE loss.We find joint schedules for learning rate and data augmentation scheme under which augmented gradient descent provably converges and characterize the resulting minimum. Our results apply to arbitrary augmentation schemes, revealing complex interactions between learning rates and augmentations even in the convex setting. Our approach interprets augmented (S)GD as a stochastic optimization method for a time-varying sequence of proxy losses. This gives a unified way to analyze learning rate, batch size, and augmentations ranging from additive noise to random projections.