Goto

Collaborating Authors

 Regression


Learning Output Embeddings in Structured Prediction

arXiv.org Machine Learning

A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension by means of output kernels, and then, solving a regression problem in this output space. A prediction in the original space is computed by solving a pre-image problem. In such an approach, the embedding, linked to the target loss, is defined prior to the learning phase. In this work, we propose to jointly learn a finite approximation of the output embedding and the regression function into the new feature space. For that purpose, we leverage a priori information on the outputs and also unexploited unsupervised output data, which are both often available in structured prediction problems. We prove that the resulting structured predictor is a consistent estimator, and derive an excess risk bound. Moreover, the novel structured prediction tool enjoys a significantly smaller computational complexity than former output kernel methods. The approach empirically tested on various structured prediction problems reveals to be versatile and able to handle large datasets.


On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression

arXiv.org Machine Learning

We consider the linear model $\mathbf{y} = \mathbf{X} \mathbf{\beta}_\star + \mathbf{\epsilon}$ with $\mathbf{X}\in \mathbb{R}^{n\times p}$ in the overparameterized regime $p>n$. We estimate $\mathbf{\beta}_\star$ via generalized (weighted) ridge regression: $\hat{\mathbf{\beta}}_\lambda = \left(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{\Sigma}_w\right)^\dagger \mathbf{X}^T\mathbf{y}$, where $\mathbf{\Sigma}_w$ is the weighting matrix. Under a random design setting with general data covariance $\mathbf{\Sigma}_x$ and anisotropic prior on the true coefficients $\mathbb{E}\mathbf{\beta}_\star\mathbf{\beta}_\star^T = \mathbf{\Sigma}_\beta$, we provide an exact characterization of the prediction risk $\mathbb{E}(y-\mathbf{x}^T\hat{\mathbf{\beta}}_\lambda)^2$ in the proportional asymptotic limit $p/n\rightarrow \gamma \in (1,\infty)$. Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting $\lambda_{\rm opt}$ for the ridge parameter $\lambda$ and confirm the implicit $\ell_2$ regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that $\lambda_{\rm opt}$ can be negative in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when both $\mathbf{X}$ and $\mathbf{\beta}_\star$ are anisotropic. Finally, we determine the optimal weighting matrix $\mathbf{\Sigma}_w$ for both the ridgeless ($\lambda\to 0$) and optimally regularized ($\lambda = \lambda_{\rm opt}$) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.


Evaluation Metrics for Regression Analysis

#artificialintelligence

These terms will come up, and it's good to get familiar with them if you aren't already: Goodness of fit is typically a term used to describe how well a dataset aligns with a certain statistical distribution. Here, we're going to think of it as a way of describing how well our model is fitted to our data. If we can think about our regression model in terms of the imaginary "best-fit" line it produces, then it makes sense that we would want to know how well this line matches our data. This goodness of fit can be quantified in a variety of ways, but the R² and the adjusted R² score are two of the most common methods for describing how well our model is capturing the variance in our target data. R² -- also called the coefficient of determination -- is a statistical measure representing the amount of variance for a dependent variable that is captured by your model's predictions.


DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks

arXiv.org Machine Learning

Recent years have witnessed strong empirical performance of over-parameterized neural networks on various tasks and many advances in the theory, e.g. the universal approximation and provable convergence to global minimum. In this paper, we incorporate over-parameterized neural networks into semi-parametric models to bridge the gap between inference and prediction, especially in the high dimensional linear problem. By doing so, we can exploit a wide class of networks to approximate the nuisance functions and to estimate the parameters of interest consistently. Therefore, we may offer the best of two worlds: the universal approximation ability from neural networks and the interpretability from classic ordinary linear model, leading to valid inference and accurate prediction. We show the theoretical foundations that make this possible and demonstrate with numerical experiments. Furthermore, we propose a framework, DebiNet, in which we plug-in arbitrary feature selection methods to our semi-parametric neural network and illustrate that our framework debiases the regularized estimators and performs well, in terms of the post-selection inference and the generalization error.


Learning Deep Features in Instrumental Variable Regression

arXiv.org Machine Learning

Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by utilizing an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task.


Your Data Science Toolbox -- What is Inside?

#artificialintelligence

Data science is a very broad multi-disciplinary field that includes several subdivisions such as data visualization, machine learning, and artificial intelligence. Due to the broadness of the field and because data science is constantly changing due to technological innovations and the development of new algorithms, a successful data scientist has to maintain a big and updated toolbox at all times. Keep in mind that as a data scientist, you can only perform tasks that you have the right tools for. This article will discuss several tools that one can include in their data science toolbox. Knowledge-based tools can be grouped into three main categories based on the level of data science tasks involved: level 1 (basic level); level 2 (intermediate level); and level 3 (advanced level). Basic tools are tools that would enable one to perform level 1 tasks.


Error-Correcting Output Codes (ECOC) for Machine Learning

#artificialintelligence

Machine learning algorithms, like logistic regression and support vector machines, are designed for two-class (binary) classification problems. As such, these algorithms must either be modified for multi-class (more than two) classification problems or not used at all. The Error-Correcting Output Codes method is a technique that allows a multi-class classification problem to be reframed as multiple binary classification problems, allowing the use of native binary classification models to be used directly. Unlike one-vs-rest and one-vs-one methods that offer a similar solution by dividing a multi-class classification problem into a fixed number of binary classification problems, the error-correcting output codes technique allows each class to be encoded as an arbitrary number of binary classification problems. When an overdetermined representation is used, it allows the extra models to act as "error-correction" predictions that can result in better predictive performance.


On Optimality of Meta-Learning in Fixed-Design Regression with Weighted Biased Regularization

arXiv.org Machine Learning

We consider a fixed-design linear regression in the meta-learning model of Baxter (2000) and establish a problem-dependent finite-sample lower bound on the transfer risk (risk on a newly observed task) valid for all estimators. Moreover, we prove that a weighted form of a biased regularization - a popular technique in transfer and meta-learning - is optimal, i.e. it enjoys a problem-dependent upper bound on the risk matching our lower bound up to a constant. Thus, our bounds characterize meta-learning linear regression problems and reveal a fine-grained dependency on the task structure. Our characterization suggests that in the non-asymptotic regime, for a sufficiently large number of tasks, meta-learning can be considerably superior to a single-task learning. Finally, we propose a practical adaptation of the optimal estimator through Expectation-Maximization procedure and show its effectiveness in series of experiments.


Strongly universally consistent nonparametric regression and classification with privatised data

arXiv.org Machine Learning

In this paper we revisit the classical problem of nonparametric regression, but impose local differential privacy constraints. Under such constraints, the raw data $(X_1,Y_1),\ldots,(X_n,Y_n)$, taking values in $\mathbb{R}^d \times \mathbb{R}$, cannot be directly observed, and all estimators are functions of the randomised output from a suitable privacy mechanism. The statistician is free to choose the form of the privacy mechanism, and here we add Laplace distributed noise to a discretisation of the location of a feature vector $X_i$ and to the value of its response variable $Y_i$. Based on this randomised data, we design a novel estimator of the regression function, which can be viewed as a privatised version of the well-studied partitioning regression estimator. The main result is that the estimator is strongly universally consistent. Our methods and analysis also give rise to a strongly universally consistent binary classification rule for locally differentially private data.


Estimating NBA players salary share according to their performance on court: A machine learning approach

arXiv.org Machine Learning

Professional athletes' field performance and salaries is a topic that has attracted the interest of numerous researchers (Garris and Wilkes, 2017, Olbrecht, 2009, Vincent and Eastman, 2009, Wiseman and Chatterjee, 2010, Yilmaz and Chatterjee, 2003, Zimmer and Zimmer, 2001). The general question of interest is whether players deserve their salaries based on their performance statistics. We emphasize that this relationship is not linear and hence linear models are bound to fail in capturing the underlying true association. An additional concern, separate from non-linearity, is model predictability for which internal evaluation has limitations and leads to an over-optimistic performance. These and more matters, discussed later, require delicate treatment which, if not properly addressed, will yield erroneous results.