Regression
Intelligible Machine Learning Models for HealthCare
In machine learning often a tradeoff must be made between accuracy and intelligibility: the most accurate models usually are not very intelligible (e.g., random forests, boosted trees, and neural nets), and the most intelligible models usually are less accurate (e.g., linear or logistic regression). This tradeoff often limits the accuracy of models that can be applied in mission-critical applications such as healthcare where being able to understand, validate, edit, and trust a learned model is important. We have developed a learning method based on generalized additive models (GAMs) that is often as accurate as full complexity models, but remains as intelligible as linear/logistic regression models. In the talk I'll present two case studies where these high-performance generalized additive models (GA2Ms) are applied to healthcare problems yielding intelligible models with state-of-the-art accuracy. In the pneumonia risk prediction case study, the intelligible model uncovers surprising patterns in the data that previously had prevented complex learned models from going to clinical trial, but because it is intelligible and modular allows these patterns to easily be recognized and removed.
Association Discovery and Diagnosis of Alzheimerยs Disease with Bayesian Multiview Learning
Xu, Zenglin, Zhe, Shandian, Qi, Yuan, Yu, Peng
The analysis and diagnosis of Alzheimer's disease (AD) can be based on genetic variations, e.g., single nucleotide polymorphisms (SNPs) and phenotypic traits, e.g., Magnetic Resonance Imaging (MRI) features. We consider two important and related tasks: i) to select genetic and phenotypical markers for AD diagnosis and ii) to identify associations between genetic and phenotypical data. While previous studies treat these two tasks separately, they are tightly coupled because underlying associations between genetic variations and phenotypical features contain the biological basis for a disease. Here we present a new sparse Bayesian approach for joint association study and disease diagnosis. In this approach, common latent features are extracted from different data sources based on sparse projection matrices and used to predict multiple disease severity levels; in return, the disease status can guide the discovery of relationships between data sources. The sparse projection matrices not only reveal interactions between data sources but also select groups of biomarkers related to the disease. Moreover, to take advantage of the linkage disequilibrium (LD) measuring the non-random association of alleles, we incorporate a graph Laplacian type of prior in the model. To learn the model from data, we develop an efficient variational inference algorithm. Analysis on an imaging genetics dataset for the study of Alzheimer's Disease (AD) indicates that our model identifies biologically meaningful associations between genetic variations and MRI features, and achieves significantly higher accuracy for predicting ordinal AD stages than the competing methods.
L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs
Neykov, Matey, Liu, Jun S., Cai, Tianxi
It is known that for a certain class of single index models (SIMs) $Y = f(\boldsymbol{X}_{p \times 1}^\intercal\boldsymbol{\beta}_0, \varepsilon)$, support recovery is impossible when $\boldsymbol{X} \sim \mathcal{N}(0, \mathbb{I}_{p \times p})$ and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested. These algorithms work provably under the assumption that the design $\boldsymbol{X}$ comes from an i.i.d. Gaussian distribution. In the present paper we analyze algorithms based on covariance screening and least squares with $L_1$ penalization (i.e. LASSO) and demonstrate that they can also enjoy optimal (up to a scalar) rescaled sample size in terms of support recovery, albeit under slightly different assumptions on $f$ and $\varepsilon$ compared to the SIR based algorithms. Furthermore, we show more generally, that LASSO succeeds in recovering the signed support of $\boldsymbol{\beta}_0$ if $\boldsymbol{X} \sim \mathcal{N}(0, \boldsymbol{\Sigma})$, and the covariance $\boldsymbol{\Sigma}$ satisfies the irrepresentable condition. Our work extends existing results on the support recovery of LASSO for the linear model, to a more general class of SIMs.
Regularization- Time to penalize
The method of regularization is very popular in the field of machine learning however you will see that many people are still not using it. One reason I can think of is because of the complexity behind the whole concept of the regularization so I thought to make it simple for all of us. In this article I am going to try to explain the regularization in a way that it is easy to understand and easy to use. Basically while I explain the concept I will give practical details t on how to implement regularization in R and SAS. In very simple terms Regularization refers to the method of preventing overfitting, by explicitly controlling the model complexity.
Logistic Regression Analysis โ Welcome LogisticRegressionAnalysis.com Fast, easy guide to understanding, running, and interpreting multivariate logistic regression
The purpose of this web site is to help you understand, run, and interpret logistic regression analyses as quickly and easily as possible. Many visitors find this web site because they realize that their data does not fit the assumptions of regular linear regression (least-squares regression). Instead they realize they need to use a method specifically designed for data where the Y-variable is binary (all explained below). Other visitors are users of logistic regression and are seeking answers to a specific question. But in both cases, this web site is here to help you.
The Effect of Heteroscedasticity on Regression Trees
Regression trees are becoming increasingly popular as omnibus predicting tools and as the basis of numerous modern statistical learning ensembles. Part of their popularity is their ability to create a regression prediction without ever specifying a structure for the mean model. However, the method implicitly assumes homogeneous variance across the entire explanatory-variable space. It is unknown how the algorithm behaves when faced with heteroscedastic data. In this study, we assess the performance of the most popular regression-tree algorithm in a single-variable setting under a very simple step-function model for heteroscedasticity. We use simulation to show that the locations of splits, and hence the ability to accurately predict means, are both adversely influenced by the change in variance. We identify the pruning algorithm as the main concern, although the effects on the splitting algorithm may be meaningful in some applications.
Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression
Han, Lei, Yang, Ting, Zhang, Tong
A major challenge for building statistical models in the big data era is that the available data volume may exceed the computational capability. A common approach to solve this problem is to employ a subsampled dataset that can be handled by the available computational resources. In this paper, we propose a general subsampling scheme for large-scale multi-class logistic regression, and examine the variance of the resulting estimator. We show that asymptotically, the proposed method always achieves a smaller variance than that of the uniform random sampling. Moreover, when the classes are conditional imbalanced, significant improvement over uniform sampling can be achieved. Empirical performance of the proposed method is compared to other methods on both simulated and real-world datasets, and these results match and confirm our theoretical analysis.
Cut off point in logistic regression
If your event rate is around 17% and you say that at 50% cutoff you're getting a very good classification, there's something fishy! How can a logistic model trained to fit only 17% be better than what information the dataset has? Unless, you're measure of accuracy of fit is different from misclassification! Remember, the model usually fits the remaining 83% well, so the misclassification there would be low as compared to the 17%. But I'm unsure how you're getting a 50% cutoff more accurate in terms of misclassification - since, a decrease here, is going to increase it there. The best way to find out the cutoff is by plotting for different values as already suggested, but it's usually got to be around the event rate!
De-biasing the Lasso: Optimal Sample Size for Gaussian Designs
Javanmard, Adel, Montanari, Andrea
Performing statistical inference in high-dimension is an outstanding challenge. A major source of difficulty is the absence of precise information on the distribution of high-dimensional estimators. Here, we consider linear regression in the high-dimensional regime $p\gg n$. In this context, we would like to perform inference on a high-dimensional parameters vector $\theta^*\in{\mathbb R}^p$. Important progress has been achieved in computing confidence intervals for single coordinates $\theta^*_i$. A key role in these new methods is played by a certain debiased estimator $\hat{\theta}^{\rm d}$ that is constructed from the Lasso. Earlier work establishes that, under suitable assumptions on the design matrix, the coordinates of $\hat{\theta}^{\rm d}$ are asymptotically Gaussian provided $\theta^*$ is $s_0$-sparse with $s_0 = o(\sqrt{n}/\log p )$. The condition $s_0 = o(\sqrt{n}/ \log p )$ is stronger than the one for consistent estimation, namely $s_0 = o(n/ \log p)$. We study Gaussian designs with known or unknown population covariance. When the covariance is known, we prove that the debiased estimator is asymptotically Gaussian under the nearly optimal condition $s_0 = o(n/ (\log p)^2)$. Note that earlier work was limited to $s_0 = o(\sqrt{n}/\log p)$ even for perfectly known covariance. The same conclusion holds if the population covariance is unknown but can be estimated sufficiently well, e.g. under the same sparsity conditions on the inverse covariance as assumed by earlier work. For intermediate regimes, we describe the trade-off between sparsity in the coefficients and in the inverse covariance of the design. We further discuss several applications of our results to high-dimensional inference. In particular, we propose a new estimator that is minimax optimal up to a factor $1+o_n(1)$ for i.i.d. Gaussian designs.
Predictive Models with Supervised learning in R
The concept of statistical learning started from the method of least squares in the early 1900s has led to the invention of linear regression method. Most of the concepts at those times were applied to astronomical science. The evolution of linear and multiple regression methods gave rise to quantitative statistical computing. Statistical computing divides the majority of the conundrums into two categories. Those are supervised and unsupervised learning categories.