Goto

Collaborating Authors

 Regression


Robust Linear Regression: Optimal Rates in Polynomial Time

arXiv.org Machine Learning

We obtain a robust and computationally efficient estimator for Linear Regression that achieves statistically optimal convergence rate under mild distributional assumptions. Concretely, we assume our data is drawn from a $k$-hypercontractive distribution and an $\epsilon$-fraction is adversarially corrupted. We then describe an estimator that converges to the optimal least-squares minimizer for the true distribution at a rate proportional to $\epsilon^{2-2/k}$, when the noise is independent of the covariates. We note that no such estimator was known prior to our work, even with access to unbounded computation. The rate we achieve is information-theoretically optimal and thus we resolve the main open question in Klivans, Kothari and Meka [COLT'18]. Our key insight is to identify an analytic condition relating the distribution over the noise and covariates that completely characterizes the rate of convergence, regardless of the noise model. In particular, we show that when the moments of the noise and covariates are negatively-correlated, we obtain the same rate as independent noise. Further, when the condition is not satisfied, we obtain a rate proportional to $\epsilon^{2-4/k}$, and again match the information-theoretic lower bound. Our central technical contribution is to algorithmically exploit independence of random variables in the "sum-of-squares" framework by formulating it as a polynomial identity.


Understanding Implicit Regularization in Over-Parameterized Nonlinear Statistical Model

arXiv.org Machine Learning

We study the implicit regularization phenomenon induced by simple optimization algorithms in over-parameterized nonlinear statistical models. Specifically, we study both vector and matrix single index models where the link function is nonlinear and unknown, the signal parameter is either a sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed. To gain a better understanding the role of implicit regularization in the nonlinear models without excess technicality, we assume that the distribution of the covariates is known as a priori. For both the vector and matrix settings, we construct an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data. We propose to estimate the true parameter by applying regularization-free gradient descent to the loss function. When the initialization is close to the origin and the stepsize is sufficiently small, we prove that the obtained solution achieves minimax optimal statistical rates of convergence in both the vector and matrix cases. In particular, for the vector single index model with Gaussian covariates, our proposed estimator is shown to enjoy the oracle statistical rate. Our results capture the implicit regularization phenomenon in over-parameterized nonlinear and noisy statistical models with possibly heavy-tailed data.


Optimal Robust Linear Regression in Nearly Linear Time

arXiv.org Machine Learning

We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = \langle X,w^* \rangle + \epsilon$ (with $X \in \mathbb{R}^d$ and $\epsilon$ independent), in which an $\eta$ fraction of the samples have been adversarially corrupted. We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $\mathbb{E} [XX^\top]$ has bounded condition number and $\epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $\epsilon$ is sub-Gaussian. In both settings, our estimators: (a) Achieve optimal sample complexities and recovery guarantees up to log factors and (b) Run in near linear time ($\tilde{O}(nd / \eta^6)$). Prior to our work, polynomial time algorithms achieving near optimal sample complexities were only known in the setting where $X$ is Gaussian with identity covariance and $\epsilon$ is Gaussian, and no linear time estimators were known for robust linear regression in any setting. Our estimators and their analysis leverage recent developments in the construction of faster algorithms for robust mean estimation to improve runtimes, and refined concentration of measure arguments alongside Gaussian rounding techniques to improve statistical sample complexities.


Machine Learning with Scikit-learn

#artificialintelligence

This blog provides an overview of how to build a Machine Learning model with details on various aspects such as data pre-processing, splitting the training and testing data, regression/classification, and finally model evaluation. Machine Learning (ML) is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions. ML systems are trained rather than explicitly programmed. It provides efficient tools for data analysis, data pre-processing, model building, model evaluation, and much more. So in this blog we will implement various ML models with the help of Scikit learn(sk-learn), which is a simple open-source Machine Learning library.


Neural Topic Models with Survival Supervision: Jointly Predicting Time-to-Event Outcomes and Learning How Clinical Features Relate

arXiv.org Machine Learning

In time-to-event prediction problems, a standard approach to estimating an interpretable model is to use Cox proportional hazards, where features are selected based on lasso regularization or stepwise regression. However, these Cox-based models do not learn how different features relate. As an alternative, we present an interpretable neural network approach to jointly learn a survival model to predict time-to-event outcomes while simultaneously learning how features relate in terms of a topic model. In particular, we model each subject as a distribution over "topics", which are learned from clinical features as to help predict a time-to-event outcome. From a technical standpoint, we extend existing neural topic modeling approaches to also minimize a survival analysis loss function. We study the effectiveness of this approach on seven healthcare datasets on predicting time until death as well as hospital ICU length of stay, where we find that neural survival-supervised topic models achieves competitive accuracy with existing approaches while yielding interpretable clinical "topics" that explain feature relationships.


Logistic Regression

#artificialintelligence

Logistic Regression is a statistical model used to determine if an independent variable has an effect on a binary dependent variable. This means that there are only two potential outcomes given an input. For example, it may be used to determine if an email is spam, or not, using the rate of misspelled words, a common sign of spam. Other forms of regression analysis, like a linear regression, require the definition of a threshold to distinguish the binary classes (e.g. Linear regression allows for a probability to be established, but it must then be applied to a logistic regression to make the distinct classification.


Ordinal Regression with Fenton-Wilkinson Order Statistics: A Case Study of an Orienteering Race

arXiv.org Machine Learning

In sports, individuals and teams are typically interested in final rankings. Final results, such as times or distances, dictate these rankings, also known as places. Places can be further associated with ordered random variables, commonly referred to as order statistics. In this work, we introduce a simple, yet accurate order statistical ordinal regression function that predicts relay race places with changeover-times. We call this function the Fenton-Wilkinson Order Statistics model. This model is built on the following educated assumption: individual leg-times follow log-normal distributions. Moreover, our key idea is to utilize Fenton-Wilkinson approximations of changeover-times alongside an estimator for the total number of teams as in the notorious German tank problem. This original place regression function is sigmoidal and thus correctly predicts the existence of a small number of elite teams that significantly outperform the rest of the teams. Our model also describes how place increases linearly with changeover-time at the inflection point of the log-normal distribution function. With real-world data from Jukola 2019, a massive orienteering relay race, the model is shown to be highly accurate even when the size of the training set is only 5% of the whole data set. Numerical results also show that our model exhibits smaller place prediction root-mean-square-errors than linear regression, mord regression and Gaussian process regression.


Misclassification cost-sensitive ensemble learning: A unifying framework

arXiv.org Machine Learning

The task of supervised machine learning is given a set of recorded observations and their outcomes to predict the outcome of new observations. Standard classification techniques aim for the highest overall accuracy or, equivalently, for the smallest total error, and include among others support vector machines, Bayesian classifiers, logistic regression, decision tree classifiers such as CART [6] and C4.5 [38], and ensemble methods which build several classifiers and aggregate their predictions such as Bagging [4], AdaBoost [16] and Random Forests [5]. Of particular interest in certain domains are binary classifiers which deal with cases where only two classes of outcomes are considered, such as fraudulent and legitimate credit card transactions, responders and non-responders to a marketing campaign, patients with and without cancer, intrusive and authorised network access, and defaulting and repaying debtors to name a few. In most of these cases, one of the classes is a small minority and consequently traditional classifiers might classify all of its members as belonging to the majority class without any significant overall accuracy loss. The severity of this class imbalance becomes more noticeable when failing to correctly predict a minority class member is more costly than doing so with a member of the majority class, as the case often is. A remedy to the undesirable situation just described are classifiers which, instead of accuracy, take misclassification costs into account and are thus termed cost-sensitive. We illustrate this idea in the credit card fraud detection framework: accepting a fraudulent transaction as legitimate incurs a cost equal to its amount.


Early stopping and polynomial smoothing in regression with reproducing kernels

arXiv.org Machine Learning

In this paper we study the problem of early stopping for iterative learning algorithms in reproducing kernel Hilbert space (RKHS) in the nonparametric regression framework. In particular, we work with gradient descent and (iterative) kernel ridge regression algorithms. We present a data-driven rule to perform early stopping without a validation set that is based on the so-called minimum discrepancy principle. This method enjoys only one assumption on the regression function: it belongs to a reproducing kernel Hilbert space (RKHS). The proposed rule is proved to be minimax optimal over different types of kernel spaces, including finite rank and Sobolev smoothness classes. The proof is derived from the fixed-point analysis of the localized Rademacher complexities, which is a standard technique for obtaining optimal rates in the nonparametric regression literature. In addition to that, we present simulations results on artificial datasets that show comparable performance of the designed rule with respect to other stopping rules such as the one determined by V-fold cross-validation.


Coded Computing for Federated Learning at the Edge

arXiv.org Machine Learning

Federated Learning (FL) is an exciting new paradigm that enables training a global model from data generated locally at the client nodes, without moving client data to a centralized server. Performance of FL in a multi-access edge computing (MEC) network suffers from slow convergence due to heterogeneity and stochastic fluctuations in compute power and communication link qualities across clients. A recent work, Coded Federated Learning (CFL), proposes to mitigate stragglers and speed up training for linear regression tasks by assigning redundant computations at the MEC server. Coding redundancy in CFL is computed by exploiting statistical properties of compute and communication delays. We develop CodedFedL that addresses the difficult task of extending CFL to distributed non-linear regression and classification problems with multioutput labels. The key innovation of our work is to exploit distributed kernel embedding using random Fourier features that transforms the training task into distributed linear regression. We provide an analytical solution for load allocation, and demonstrate significant performance gains for CodedFedL through experiments over benchmark datasets using practical network parameters.