Regression
Robust Regression Revisited: Acceleration and Improved Estimation Rates
We study fast algorithms for statistical regression problems under the strong contamination model, where the goal is to approximately optimize a generalized linear model (GLM) given adversarially corrupted samples. Prior works in this line of research were based on the \emph{robust gradient descent} framework of \cite{PrasadSBR20}, a first-order method using biased gradient queries, or the \emph{Sever} framework of \cite{DiakonikolasKK019}, an iterative outlier-removal method calling a stationary point finder. We present nearly-linear time algorithms for robust regression problems with improved runtime or estimation guarantees compared to the state-of-the-art. For the general case of smooth GLMs (e.g.\ logistic regression), we show that the robust gradient descent framework of \cite{PrasadSBR20} can be \emph{accelerated}, and show our algorithm extends to optimizing the Moreau envelopes of Lipschitz GLMs (e.g.\ support vector machines), answering several open questions in the literature. For the well-studied case of robust linear regression, we present an alternative approach obtaining improved estimation rates over prior nearly-linear time algorithms.
Weighted Linear Bandits for Non-Stationary Environments
We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments.
Sample Complexity of Learning Mixture of Sparse Linear Regressions
In the problem of learning mixtures of linear regressions, the goal is to learn a col-lection of signal vectors from a sequence of (possibly noisy) linear measurements,where each measurement is evaluated on an unknown signal drawn uniformly fromthis collection. This setting is quite expressive and has been studied both in termsof practical applications and for the sake of establishing theoretical guarantees. Inthis paper, we consider the case where the signal vectors aresparse; this generalizesthe popular compressed sensing paradigm. We improve upon the state-of-the-artresults as follows: In the noisy case, we resolve an open question of Yin et al. (IEEETransactions on Information Theory, 2019) by showing how to handle collectionsof more than two vectors and present the first robust reconstruction algorithm, i.e.,if the signals are not perfectly sparse, we still learn a good sparse approximationof the signals. In the noiseless case, as well as in the noisy case, we show how tocircumvent the need for a restrictive assumption required in the previous work.
Causal Regularization
We argue that regularizing terms in standard regression methods not only help against overfitting finite data, but sometimes also help in getting better causal models. We first consider a multi-dimensional variable linearly influencing a target variable with some multi-dimensional unobserved common cause, where the confounding effect can be decreased by keeping the penalizing term in Ridge and Lasso regression even in the population limit. The reason is a close analogy between overfitting and confounding observed for our toy model. In the case of overfitting, we can choose regularization constants via cross validation, but here we choose the regularization constant by first estimating the strength of confounding, which yielded reasonable results for simulated and real data. Further, we show a'causal generalization bound' which states (subject to our particular model of confounding) that the error made by interpreting any non-linear regression as causal model can be bounded from above whenever functions are taken from a not too rich class.
Statistical Query Lower Bounds for List-Decodable Linear Regression
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set T of labeled examples (x, y) \in \mathbb{R} d \times \mathbb{R} and a parameter 0 \alpha 1/2 such that an \alpha -fraction of the points in T are i.i.d. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of d {\mathrm{poly}(1/\alpha)} for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.
UCB-based Algorithms for Multinomial Logistic Regression Bandits
Out of the rich family of generalized linear bandits, perhaps the most well studied ones are logistic bandits that are used in problems with binary rewards: for instance, when the learner aims to maximize the profit over a user that can select one of two possible outcomes (e.g., click' vs no-click'). Despite remarkable recent progress and improved algorithms for logistic bandits, existing works do not address practical situations where the number of outcomes that can be selected by the user is larger than two (e.g., click', show me later', never show again', no click'). In this paper, we study such an extension. We use multinomial logit (MNL) to model the probability of each one of K 1\geq 2 possible outcomes ( 1 stands for the not click' outcome): we assume that for a learner's action \mathbf{x}_t, the user selects one of K 1\geq 2 outcomes, say outcome i, with a MNL probabilistic model with corresponding unknown parameter \bar{\boldsymbol{\theta}}_{\ast i} . Each outcome i is also associated with a revenue parameter \rho_i and the goal is to maximize the expected revenue.
A First-Order Algorithmic Framework for Distributionally Robust Logistic Regression
Wasserstein distance-based distributionally robust optimization (DRO) has received much attention lately due to its ability to provide a robustness interpretation of various learning models. Moreover, many of the DRO problems that arise in the learning context admits exact convex reformulations and hence can be tackled by off-the-shelf solvers. Nevertheless, the use of such solvers severely limits the applicability of DRO in large-scale learning problems, as they often rely on general purpose interior-point algorithms. On the other hand, there are very few works that attempt to develop fast iterative methods to solve these DRO problems, which typically possess complicated structures. In this paper, we take a first step towards resolving the above difficulty by developing a first-order algorithmic framework for tackling a class of Wasserstein distance-based distributionally robust logistic regression (DRLR) problem.
Efficient Truncated Linear Regression with Unknown Noise Variance
Truncated linear regression is a classical challenge in Statistics, wherein a label, y w T x \varepsilon, and its corresponding feature vector, x \in \mathbb{R} k, are only observed if the label falls in some subset S \subseteq \mathbb{R}; otherwise the existence of the pair (x, y) is hidden from observation. Linear regression with truncated observations has remained a challenge, in its general form, since the early works of [Tobin'58, Amemiya '73]. When the distribution of the error is normal with known variance, recent work of [Daskalakis et al. '19] provides computationally and statistically efficient estimators of the linear model, w . In this paper, we provide the first computationally and statistically efficient estimators for truncated linear regression when the noise variance is unknown, estimating both the linear model and the variance of the noise. Our estimator is based on an efficient implementation of Projected Stochastic Gradient Descent on the negative log-likelihood of the truncated sample.
Semi-supervised Active Linear Regression
Labeled data often comes at a high cost as it may require recruiting human labelers or running costly experiments. At the same time, in many practical scenarios, one already has access to a partially labeled, potentially biased dataset that can help with the learning task at hand. Motivated by such settings, we formally initiate a study of semi-supervised active learning'' through the frame of linear regression. In this paper, we introduce an instance dependent parameter called the reduced rank, denoted \text{R}_X, and propose an efficient algorithm with query complexity O(\text{R}_X/\epsilon) . This result directly implies improved upper bounds for two important special cases: (i) active ridge regression, and (ii) active kernel ridge regression, where the reduced-rank equates to the statistical dimension'', \textsf{sd}_\lambda and effective dimension'', d_\lambda of the problem respectively, where \lambda \ge 0 denotes the regularization parameter. Finally, we introduce a distributional version of the problem as a special case of the agnostic formulation we consider earlier; here, for every X, we prove a matching instance-wise lower bound of \Omega (\text{R}_X / \epsilon) on the query complexity of any algorithm.
First order expansion of convex regularized estimators
We consider first order expansions of convex penalized estimators in high-dimensional regression problems with random designs. Our setting includes linear regression and logistic regression as special cases. For a given penalty function h and the corresponding penalized estimator \hbeta, we construct a quantity \eta, the first order expansion of \hbeta, such that the distance between \hbeta and \eta is an order of magnitude smaller than the estimation error \ \hat{\beta} - \beta *\ . In this sense, the first order expansion \eta can be thought of as a generalization of influence functions from the mathematical statistics literature to regularized estimators in high-dimensions. Such first order expansion implies that the risk of \hat{\beta} is asymptotically the same as the risk of \eta which leads to a precise characterization of the MSE of \hbeta; this characterization takes a particularly simple form for isotropic design.