Regression
List-Decodable Linear Regression
Karmalkar, Sushrut, Klivans, Adam R., Kothari, Pravesh K.
We give the first polynomial-time algorithm for robust regression in the list-decodable setting where an adversary can corrupt a greater than $1/2$ fraction of examples. For any $\alpha < 1$, our algorithm takes as input a sample $\{(x_i,y_i)\}_{i \leq n}$ of $n$ linear equations where $\alpha n$ of the equations satisfy $y_i = \langle x_i,\ell^*\rangle +\zeta$ for some small noise $\zeta$ and $(1-\alpha)n$ of the equations are {\em arbitrarily} chosen. It outputs a list $L$ of size $O(1/\alpha)$ - a fixed constant - that contains an $\ell$ that is close to $\ell^*$. Our algorithm succeeds whenever the inliers are chosen from a \emph{certifiably} anti-concentrated distribution $D$. In particular, this gives a $(d/\alpha)^{O(1/\alpha^8)}$ time algorithm to find a $O(1/\alpha)$ size list when the inlier distribution is standard Gaussian. For discrete product distributions that are anti-concentrated only in \emph{regular} directions, we give an algorithm that achieves similar guarantee under the promise that $\ell^*$ has all coordinates of the same magnitude. To complement our result, we prove that the anti-concentration assumption on the inliers is information-theoretically necessary. Our algorithm is based on a new framework for list-decodable learning that strengthens the `identifiability to algorithms' paradigm based on the sum-of-squares method. In an independent and concurrent work, Raghavendra and Yau also used the Sum-of-Squares method to give a similar result for list-decodable regression.
Sum-of-squares meets square loss: Fast rates for agnostic tensor completion
Foster, Dylan J., Risteski, Andrej
We study tensor completion in the agnostic setting. In the classical tensor completion problem, we receive $n$ entries of an unknown rank-$r$ tensor and wish to exactly complete the remaining entries. In agnostic tensor completion, we make no assumption on the rank of the unknown tensor, but attempt to predict unknown entries as well as the best rank-$r$ tensor. For agnostic learning of third-order tensors with the square loss, we give the first polynomial time algorithm that obtains a "fast" (i.e., $O(1/n)$-type) rate improving over the rate obtained by reduction to matrix completion. Our prediction error rate to compete with the best $d\times{}d\times{}d$ tensor of rank-$r$ is $\tilde{O}(r^{2}d^{3/2}/n)$. We also obtain an exact oracle inequality that trades off estimation and approximation error. Our algorithm is based on the degree-six sum-of-squares relaxation of the tensor nuclear norm. The key feature of our analysis is to show that a certain characterization for the subgradient of the tensor nuclear norm can be encoded in the sum-of-squares proof system. This unlocks the standard toolbox for localization of empirical processes under the square loss, and allows us to establish restricted eigenvalue-type guarantees for various tensor regression models, with tensor completion as a special case. The new analysis of the relaxation complements Barak and Moitra (2016), who gave slow rates for agnostic tensor completion, and Potechin and Steurer (2017), who gave exact recovery guarantees for the noiseless setting. Our techniques are user-friendly, and we anticipate that they will find use elsewhere.
Fair Regression: Quantitative Definitions and Reduction-based Algorithms
Agarwal, Alekh, Dudík, Miroslav, Wu, Zhiwei Steven
In this paper, we study the prediction of a real-valued target, such as a risk score or recidivism rate, while guaranteeing a quantitative notion of fairness with respect to a protected attribute such as gender or race. We call this class of problems \emph{fair regression}. We propose general schemes for fair regression under two notions of fairness: (1) statistical parity, which asks that the prediction be statistically independent of the protected attribute, and (2) bounded group loss, which asks that the prediction error restricted to any protected group remain below some pre-determined level. While we only study these two notions of fairness, our schemes are applicable to arbitrary Lipschitz-continuous losses, and so they encompass least-squares regression, logistic regression, quantile regression, and many other tasks. Our schemes only require access to standard risk minimization algorithms (such as standard classification or least-squares regression) while providing theoretical guarantees on the optimality and fairness of the obtained solutions. In addition to analyzing theoretical properties of our schemes, we empirically demonstrate their ability to uncover fairness--accuracy frontiers on several standard datasets.
Arterial incident duration prediction using a bi-level framework of extreme gradient-tree boosting
Mihaita, Adriana-Simona, Liu, Zheyuan, Cai, Chen, Rizoiu, Marian-Andrei
Abstract: Predicting traffic incident duration is a major challenge for many traffic centres around the world. Most research studies focus on predicting the incident duration on motorways rather than arterial roads, due to a high network complexity and lack of data. In this paper we propose a bi-level framework for predicting the accident duration on arterial road networks in Sydney, based on operational requirements of incident clearance target which is less than 45 minutes. Using incident baseline information, we first deploy a classification method using various ensemble tree models in order to predict whether a new incident will be cleared in less than 45min or not. If the incident was classified as short-term, then various regression models are developed for predicting the actual incident duration in minutes by incorporating various traffic flow features. After outlier removal and intensive model hyper-parameter tuning through randomized search and cross-validation, we show that the extreme gradient boost approach outperformed all models, including the gradient-boosted decision-trees by almost 53%. Finally, we perform a feature importance evaluation for incident duration prediction and show that the best prediction results are obtained when leveraging the real-time traffic flow in vicinity road sections to the reported accident location. Initial methods used to predict the incident duration were 1. Introduction Bayesian classifiers [5], discrete choice models (DCM) [6], probabilistic distribution analyses [7], and the hazard-based Traffic congestion is a major concern for many cities duration models (HBDM) [8].
EM Converges for a Mixture of Many Linear Regressions
Kwon, Jeongyeol, Caramanis, Constantine
We study the convergence of the Expectation-Maximization (EM) algorithm for mixtures of linear regressions with an arbitrary number $k$ of components. We show that as long as signal-to-noise ratio (SNR) is more than $\tilde{O}(k^2)$, well-initialized EM converges to the true regression parameters. Previous results for $k \geq 3$ have only established local convergence for the noiseless setting, i.e., where SNR is infinitely large. Our results establish a near optimal statistical error rate of $\tilde{O}(\sigma \sqrt{k^2 d/n})$ for (sample-splitting) finite-sample EM with $k$ components, where $d$ is dimension, $n$ is the number of samples, and $\sigma$ is the variance of noise. In particular, our results imply exact recovery as $\sigma \rightarrow 0$, in contrast to most previous local convergence results for EM, where the statistical error scaled with the norm of parameters. Standard moment-method approaches suffice to guarantee we are in the region where our local convergence guarantees apply.
Model-Agnostic Counterfactual Explanations for Consequential Decisions
Karimi, Amir-Hossein, Barthe, Gilles, Balle, Borja, Valera, Isabel
Predictive models are being increasingly used to support consequential decision making at the individual level in contexts such as pretrial bail and loan approval. As a result, there is increasing social and legal pressure to provide explanations that help the affected individuals not only to understand why a prediction was output, but also how to act to obtain a desired outcome. To this end, several works have proposed methods to generate counterfactual explanations. However, they are often restricted to a particular subset of models (e.g., decision trees or linear models), and cannot directly handle the mixed (numerical and nominal) nature of the features describing each individual. In this paper, we propose a model-agnostic algorithm to generate counterfactual explanations that builds on the standard theory and tools from formal verification. Specifically, our algorithm solves a sequence of satisfiability problems, where a wide variety of predictive models and distances in mixed feature spaces, as well as natural notions of plausibility and diversity, are represented as logic formulas. Our experiments on real-world data demonstrate that our approach can flexibly handle widely deployed predictive models, while providing meaningfully closer counterfactuals than existing approaches.
Recursive Estimation for Sparse Gaussian Process Regression
Schürch, Manuel, Azzimonti, Dario, Benavoli, Alessio, Zaffalon, Marco
Gaussian Processes (GPs) are powerful kernelized methods for non-parameteric regression used in many applications. However, their plain usage is limited to a few thousand of training samples due to their cubic time complexity. In order to scale GPs to larger datasets, several sparse approximations based on so-called inducing points have been proposed in the literature. The majority of previous work has focused on the batch setting, whereas in this work we focusing on the training with mini-batches. In particular, we investigate the connection between a general class of sparse inducing point GP regression methods and Bayesian recursive estimation which enables Kalman Filter and Information Filter like updating for online learning. Moreover, exploiting ideas from distributed estimation, we show how our approach can be distributed. For unknown parameters, we propose a novel approach that relies on recursively propagating the analytical gradients of the posterior over mini-batches of the data. Compared to state of the art methods, we have analytic updates for the mean and covariance of the posterior, thus reducing drastically the size of the optimization problem. We show that our method achieves faster convergence and superior performance compared to state of the art sequential Gaussian Process regression on synthetic GP as well as real-world data with up to a million of data samples.
The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial
Ghojogh, Benyamin, Crowley, Mark
In this tutorial paper, we first define mean squared error, variance, covariance, and bias of both random variables and classification/predictor models. Then, we formulate the true and generalization errors of the model for both training and validation/test instances where we make use of the Stein's Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and generalization using the obtained true and generalization errors. We introduce cross validation and two well-known examples which are $K$-fold and leave-one-out cross validations. We briefly introduce generalized cross validation and then move on to regularization where we use the SURE again. We work on both $\ell_2$ and $\ell_1$ norm regularizations. Then, we show that bootstrap aggregating (bagging) reduces the variance of estimation. Boosting, specifically AdaBoost, is introduced and it is explained as both an additive model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper bound on the generalization error of boosting is also provided to show why boosting prevents from overfitting. As examples of regularization, the theory of ridge and lasso regressions, weight decay, noise injection to input/weights, and early stopping are explained. Random forest, dropout, histogram of oriented gradients, and single shot multi-box detector are explained as examples of bagging in machine learning and computer vision. Finally, boosting tree and SVM models are mentioned as examples of boosting.
An Investigation of Data Poisoning Defenses for Online Learning
Wang, Yizhen, Chaudhuri, Kamalika
Machine learning is increasingly used in safety-critical applications, and hence designing machine learning algorithms in the presence of an adversary has been a topic of active research [2, 3, 4, 5, 11, 12, 13]. A style of adversary that is commonly studied is data poisoning attacks [4, 12, 15, 21] where the adversary can modify or corrupt a small fraction of training examples with the goal of forcing the trained classifier to have low classification accuracy. Such attacks have threatened many real-world applications including spam filters [23], malware detection [25], sentiment analysis [24] and collaborative filtering [15]. There has been a body of prior work on data poisoning with increasingly sophisticated attacks and defenses [4, 12, 15, 21, 22, 27, 29, 30]. However, the literature largely suffers from two main limitations. First, most work is on the batch setting - all data is provided in advance and the adversary assumes that the learner's goal is to produce an empirical minimizer of a loss. This excludes many modern machine learning algorithms, such as, stochastic gradient descent, or learning from a data stream.
Deep-gKnock: nonlinear group-feature selection with deep neural network
Feature selection is central to contemporary high-dimensional data analysis. Grouping structure among features arises naturally in various scientific problems. Many methods have been proposed to incorporate the grouping structure information into feature selection. However, these methods are normally restricted to a linear regression setting. To relax the linear constraint, we combine the deep neural networks (DNNs) with the recent Knockoffs technique, which has been successful in an individual feature selection context. We propose Deep-gKnock (Deep group-feature selection using Knockoffs) as a methodology for model interpretation and dimension reduction. Deep-gKnock performs model-free group-feature selection by controlling group-wise False Discovery Rate (gFDR). Our method improves the interpretability and reproducibility of DNNs. Experimental results on both synthetic and real data demonstrate that our method achieves superior power and accurate gFDR control compared with state-of-the-art methods.