Regression
Inference in High Dimensions with the Penalized Score Test
Voorman, Arend, Shojaie, Ali, Witten, Daniela
In recent years, there has been considerable theoretical development regarding variable selection consistency of penalized regression techniques, such as the lasso. However, there has been relatively little work on quantifying the uncertainty in these selection procedures. In this paper, we propose a new method for inference in high dimensions using a score test based on penalized regression. In this test, we perform penalized regression of an outcome on all but a single feature, and test for correlation of the residuals with the held-out feature. This procedure is applied to each feature in turn. Interestingly, when an $\ell_1$ penalty is used, the sparsity pattern of the lasso corresponds exactly to a decision based on the proposed test. Further, when an $\ell_2$ penalty is used, the test corresponds precisely to a score test in a mixed effects model, in which the effects of all but one feature are assumed to be random. We formulate the hypothesis being tested as a compromise between the null hypotheses tested in simple linear regression on each feature and in multiple linear regression on all features, and develop reference distributions for some well-known penalties. We also examine the behavior of the test on real and simulated data.
Extreme Logistic Regression: A Large Scale Learning Algorithm with Application to Prostate Cancer Mortality Prediction
Ngufor, Che (George Mason University) | Wojtusiak, Janusz (George Mason University) | Hooker, Andrea (George Mason University) | Oz, Talha (George Mason University) | Hadley, Jack (George Mason University)
With the recent popularity of electronic medical records, enormous amount of medical data is being generated every day at an exponential rate.Machine learning methods have been shown in many studies to be capable of producing automatic medical diagnostic models such as automated prognostic models. However, many powerful machine learning algorithms such as support vector machine (SVM), Random Forest (RF) or Kernel Logistic Regression (KLR) are unbearably slow for very large datasets. This makes their use in medical research limited to small to medium scale problems.This study is motivated by an ongoing research on prostate cancer mortality prediction for a national representative of US population where the SVM and RF took several hours or days to trainwhereas simple linear methods such as logistic regression or linear discriminant analysis take minutes or even seconds.Because, most real-world problems are non-linear, this paper presents a large scale algorithm enabling a recently proposed least squares extreme logistic regression to learn very large datasets. The algorithm is shown on a case study of mortality prediction for men diagnosed with early stage prostate cancer to provide very fast and more accurate result than standard statistical methods.
Automatic Construction and Natural-Language Description of Nonparametric Regression Models
Lloyd, James Robert, Duvenaud, David, Grosse, Roger, Tenenbaum, Joshua B., Ghahramani, Zoubin
This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models this allows us to automatically describe functions in simple terms. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.
Learning optimization models in the presence of unknown relations
Verwer, Sicco, Zhang, Yingqian, Ye, Qing Chuan
In a sequential auction with multiple bidding agents, it is highly challenging to determine the ordering of the items to sell in order to maximize the revenue due to the fact that the autonomy and private information of the agents heavily influence the outcome of the auction. The main contribution of this paper is two-fold. First, we demonstrate how to apply machine learning techniques to solve the optimal ordering problem in sequential auctions. We learn regression models from historical auctions, which are subsequently used to predict the expected value of orderings for new auctions. Given the learned models, we propose two types of optimization methods: a black-box best-first search approach, and a novel white-box approach that maps learned models to integer linear programs (ILP) which can then be solved by any ILP-solver. Although the studied auction design problem is hard, our proposed optimization methods obtain good orderings with high revenues. Our second main contribution is the insight that the internal structure of regression models can be efficiently evaluated inside an ILP solver for optimization purposes. To this end, we provide efficient encodings of regression trees and linear regression models as ILP constraints. This new way of using learned models for optimization is promising. As the experimental results show, it significantly outperforms the black-box best-first search in nearly all settings.
Estimating nonlinear regression errors without doing regression
A method for estimating nonlinear regression errors and their distributions without performing regression is presented. Assuming continuity of the modeling function the variance is given in terms of conditional probabilities extracted from the data. For N data points the computational demand is N2. Comparing the predicted residual errors with those derived from a linear model assumption provides a signal for nonlinearity. The method is successfully illustrated with data generated by the Ikeda and Lorenz maps augmented with noise. As a by-product the embedding dimensions of these maps are also extracted.
A Permutation Approach for Selecting the Penalty Parameter in Penalized Model Selection
Sabourin, Jeremy, Valdar, William, Nobel, Andrew
The analysis of high dimensional data, in which the number of measured predictors is large and can exceed the number of samples, is an important and common problem in statistical applications. When samples are accompanied by a real or categorical response, data analysis typically includes model fitting with the aim of doing prediction or variable selection, or both. The goal of prediction is to derive a rule capable of accurately predicting the response of a new, unlabeled sample. The goal of variable selection is to select a (small) subset of the measured predictors whose individual or coordinated activity is significantly related to the response. In both cases, it is common to assume that the observed data arise from an underlying model that is sparse, in the sense that only a small subset of the predictors are related to the response. Whether sparsity is assumed, or viewed as a desirable feature of a model, analysis of high dimensional data is often carried out by penalized methods that produce models in which a relatively small subset of the available predictors are included. Popular penalized methods include the LASSO (Tibshirani, 1996), its numerous variations, and SCAD (Fan and Li, 2001). In what follows, we focus our attention on the LASSO. The LASSO and its variants require specification of a penalty/tuning parameter that controls the tradeoff between model fit and model size.
Lossy Compression via Sparse Linear Regression: Computationally Efficient Encoding and Decoding
Venkataramanan, Ramji, Sarkar, Tuhin, Tatikonda, Sekhar
We propose computationally efficient encoders and decoders for lossy compression using a Sparse Regression Code. The codebook is defined by a design matrix and codewords are structured linear combinations of columns of this matrix. The proposed encoding algorithm sequentially chooses columns of the design matrix to successively approximate the source sequence. It is shown to achieve the optimal distortion-rate function for i.i.d Gaussian sources under the squared-error distortion criterion. For a given rate, the parameters of the design matrix can be varied to trade off distortion performance with encoding complexity. An example of such a trade-off as a function of the block length n is the following. With computational resource (space or time) per source sample of O((n/\log n)^2), for a fixed distortion-level above the Gaussian distortion-rate function, the probability of excess distortion decays exponentially in n. The Sparse Regression Code is robust in the following sense: for any ergodic source, the proposed encoder achieves the optimal distortion-rate function of an i.i.d Gaussian source with the same variance. Simulations show that the encoder has good empirical performance, especially at low and moderate rates.
Systematic Ensemble Learning for Regression
Aldave, Roberto, Dussault, Jean-Pierre
The motivation of this work is to improve the performance of standard stacking approaches or ensembles, which are composed of simple, heterogeneous base models, through the integration of the generation and selection stages for regression problems. We propose two extensions to the standard stacking approach. In the first extension we combine a set of standard stacking approaches into an ensemble of ensembles using a two-step ensemble learning in the regression setting. The second extension consists of two parts. In the initial part a diversity mechanism is injected into the original training data set, systematically generating different training subsets or partitions, and corresponding ensembles of ensembles. In the final part after measuring the quality of the different partitions or ensembles, a max-min rule-based selection algorithm is used to select the most appropriate ensemble/partition on which to make the final prediction. We show, based on experiments over a broad range of data sets, that the second extension performs better than the best of the standard stacking approaches, and is as good as the oracle of databases, which has the best base model selected by cross-validation for each data set. In addition to that, the second extension performs better than two state-of-the-art ensemble methods for regression, and it is as good as a third state-of-the-art ensemble method.
Can Cascades be Predicted?
Cheng, Justin, Adamic, Lada A., Dow, P. Alex, Kleinberg, Jon, Leskovec, Jure
On many social networking web sites such as Facebook and Twitter, resharing or reposting functionality allows users to share others' content with their own friends or followers. As content is reshared from user to user, large cascades of reshares can form. While a growing body of research has focused on analyzing and characterizing such cascades, a recent, parallel line of work has argued that the future trajectory of a cascade may be inherently unpredictable. In this work, we develop a framework for addressing cascade prediction problems. On a large sample of photo reshare cascades on Facebook, we find strong performance in predicting whether a cascade will continue to grow in the future. We find that the relative growth of a cascade becomes more predictable as we observe more of its reshares, that temporal and structural features are key predictors of cascade size, and that initially, breadth, rather than depth in a cascade is a better indicator of larger cascades. This prediction performance is robust in the sense that multiple distinct classes of features all achieve similar performance. We also discover that temporal features are predictive of a cascade's eventual shape. Observing independent cascades of the same content, we find that while these cascades differ greatly in size, we are still able to predict which ends up the largest.
Discriminative Functional Connectivity Measures for Brain Decoding
Firat, Orhan, Ozay, Mete, Oztekin, Ilke, Vural, Fatos T. Yarman
We propose a statistical learning model for classifying cognitive processes based on distributed patterns of neural activation in the brain, acquired via functional magnetic resonance imaging (fMRI). In the proposed learning method, local meshes are formed around each voxel. The distance between voxels in the mesh is determined by using a functional neighbourhood concept. In order to define the functional neighbourhood, the similarities between the time series recorded for voxels are measured and functional connectivity matrices are constructed. Then, the local mesh for each voxel is formed by including the functionally closest neighbouring voxels in the mesh. The relationship between the voxels within a mesh is estimated by using a linear regression model. These relationship vectors, called Functional Connectivity aware Local Relational Features (FC-LRF) are then used to train a statistical learning machine. The proposed method was tested on a recognition memory experiment, including data pertaining to encoding and retrieval of words belonging to ten different semantic categories. Two popular classifiers, namely k-nearest neighbour (k-nn) and Support Vector Machine (SVM), are trained in order to predict the semantic category of the item being retrieved, based on activation patterns during encoding. The classification performance of the Functional Mesh Learning model, which range in 62%-71% is superior to the classical multi-voxel pattern analysis (MVPA) methods, which range in 40%-48%, for ten semantic categories.