Regression
A Bayesian Group Sparse Multi-Task Regression Model for Imaging Genetics
Greenlaw, Keelin, Szefer, Elena, Graham, Jinko, Lesperance, Mary, Nathoo, Farouk S.
Motivation: Recent advances in technology for brain imaging and high-throughput genotyping have motivated studies examining the influence of genetic variation on brain structure. Wang et al. (Bioinformatics, 2012) have developed an approach for the analysis of imaging genomic studies using penalized multi-task regression with regularization based on a novel group $l_{2,1}$-norm penalty which encourages structured sparsity at both the gene level and SNP level. While incorporating a number of useful features, the proposed method only furnishes a point estimate of the regression coefficients; techniques for conducting statistical inference are not provided. A new Bayesian method is proposed here to overcome this limitation. Results: We develop a Bayesian hierarchical modeling formulation where the posterior mode corresponds to the estimator proposed by Wang et al. (Bioinformatics, 2012), and an approach that allows for full posterior inference including the construction of interval estimates for the regression parameters. We show that the proposed hierarchical model can be expressed as a three-level Gaussian scale mixture and this representation facilitates the use of a Gibbs sampling algorithm for posterior simulation. Simulation studies demonstrate that the interval estimates obtained using our approach achieve adequate coverage probabilities that outperform those obtained from the nonparametric bootstrap. Our proposed methodology is applied to the analysis of neuroimaging and genetic data collected as part of the Alzheimer's Disease Neuroimaging Initiative (ADNI), and this analysis of the ADNI cohort demonstrates clearly the value added of incorporating interval estimation beyond only point estimation when relating SNPs to brain imaging endophenotypes.
Machine Learning Done Wrong
In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data. When dealing with small amounts of data, it's reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit "big data", it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly. As pointed out in my previous post, there are dozens of ways to solve a given modeling problem. Each model assumes something different, and it's not obvious how to navigate and identify which assumptions are reasonable.
From both sides now: the math of linear regression ยท
Linear regression is the most basic and the most widely used technique in machine learning; yet for all its simplicity, studying it can unlock some of the most important concepts in statistics. If you have a basic undestanding of linear regression expressed as \hat{Y} \theta_0 \theta_1X, but don't have a background in statistics and find statements like "ridge regression is equivalent to the maximum a posteriori (MAP) estimate with a zero-mean Gaussian prior" bewildering, then this post is for you. With a superficial goal of understanding that somewhat obtuse statement, its main objective is to explore the topic, starting from the standard formulation of linear regression, moving on to the probabilistic approach (maximum likelihood formulation) and from there to Bayesian linear regression. I'll use the \theta character throughout to refer to the coefficients (weights) of a regression model, either explicitly broken out as \theta_0 and \theta_1 for intercept and slope respectively, or just \theta referring to the vector of coefficients. I'll usually use the expression \theta Tx_i for the prediction a model gives at x_i, the assumption being that a 1 has been added to the vector of values at x_i . 1 In the single predictor case, we know that the least squares fit is the line that minimizes the sum of the squared distances between observed data and predicted values, i.e. it minimizes the Residual Sum of Squares (RSS): These residuals are pretty important in how we reason about our model.
Gentlest Intro to TensorFlow #3: Matrices & Multi-feature Linear Regression โ All of us are belong to machines
Summary: With concepts of single-feature linear-regression, cost function, gradient descent (from Part 1), epoch, learn-rate, gradient descent variation (from Part 2) under our belt, we are ready to progress to multi-feature linear regression with TensorFlow (TF). If you are already familiar with matrices and multi-feature linear regression, skip to the end for the multi-feature Tensorflow code cheatsheet, or even skip this entire article. The premise of the previous articles was: given any house size (square meters/sqm), which is the feature, we want to predict the house price (), the outcome. In reality, any prediction relies on multiple features, so we advance from single-feature to 2-feature linear regression; we chose 2 features to keep visualization and comprehension simple, but the concept generalizes to any number of features. We introduce a new feature, 'Rooms' (number of units in the house).
Interpreting the results of linear regression โ EFavDB
The full code is available as an IPython notebook on github. Assuming a multivariate normal distribution for the residuals in linear regression allows us to construct test statistics and therefore specify uncertainty in our fits. A t-test judges the explanatory power of a predictor in isolation, although the standard error that appears in the calculation of the t-statistic is a function of the other predictors in the model. On the other hand, an F-test is a global test that judges the explanatory power of all the predictors together, and we've seen that parsimony in choosing predictors can improve the quality of the overall regression. We've also seen that multicollinearity can throw off the results of individual t-tests as well as obscure the interpretation of the signs of the fitted coefficients. A symptom of multicollinearity is when none of the individual coefficients are significant but the overall F-test is significant.
How to Scale Machine Learning Data From Scratch With Python - Machine Learning Mastery
Many machine learning algorithms expect data to be scaled consistently. There are two popular methods that you should consider when scaling your data for machine learning. In this tutorial, you will discover how you can rescale your data for machine learning. How To Prepare Machine Learning Data From Scratch With Python Photo by Ondra Chotovinsky, some rights reserved. Many machine learning algorithms expect the scale of the input and even the output data to be equivalent. It can help in methods that weight inputs in order to make a prediction, such as in linear regression and logistic regression.
Aboveground biomass mapping in French Guiana by combining remote sensing, forest inventories and environmental data
Fayad, Ibrahim, Baghdadi, Nicolas, Guitet, Stรฉphane, Bailly, Jean-Stรฉphane, Hรฉrault, Bruno, Gond, Valรฉry, Hajj, Mahmoud, Minh, Dinh Ho Tong
Mapping forest aboveground biomass (AGB) has become an important task, particularly for the reporting of carbon stocks and changes. AGB can be mapped using synthetic aperture radar data (SAR) or passive optical data. However, these data are insensitive to high AGB levels (\textgreater{}150 Mg/ha, and \textgreater{}300 Mg/ha for P-band), which are commonly found in tropical forests. Studies have mapped the rough variations in AGB by combining optical and environmental data at regional and global scales. Nevertheless, these maps cannot represent local variations in AGB in tropical forests. In this paper, we hypothesize that the problem of misrepresenting local variations in AGB and AGB estimation with good precision occurs because of both methodological limits (signal saturation or dilution bias) and a lack of adequate calibration data in this range of AGB values. We test this hypothesis by developing a calibrated regression model to predict variations in high AGB values (mean \textgreater{}300 Mg/ha) in French Guiana by a methodological approach for spatial extrapolation with data from the optical geoscience laser altimeter system (GLAS), forest inventories, radar, optics, and environmental variables for spatial inter-and extrapolation. Given their higher point count, GLAS data allow a wider coverage of AGB values. We find that the metrics from GLAS footprints are correlated with field AGB estimations (R 2 =0.54, RMSE=48.3 Mg/ha) with no bias for high values. First, predictive models, including remote-sensing, environmental variables and spatial correlation functions, allow us to obtain "wall-to-wall" AGB maps over French Guiana with an RMSE for the in situ AGB estimates of ~51 Mg/ha and R${}^2$=0.48 at a 1-km grid size. We conclude that a calibrated regression model based on GLAS with dependent environmental data can produce good AGB predictions even for high AGB values if the calibration data fit the AGB range. We also demonstrate that small temporal and spatial mismatches between field data and GLAS footprints are not a problem for regional and global calibrated regression models because field data aim to predict large and deep tendencies in AGB variations from environmental gradients and do not aim to represent high but stochastic and temporally limited variations from forest dynamics. Thus, we advocate including a greater variety of data, even if less precise and shifted, to better represent high AGB values in global models and to improve the fitting of these models for high values.
Asymptotic Analysis of Objectives based on Fisher Information in Active Learning
Sourati, Jamshid, Akcakaya, Murat, Leen, Todd K., Erdogmus, Deniz, Dy, Jennifer G.
Obtaining labels can be costly and time-consuming. Active learning allows a learning algorithm to intelligently query samples to be labeled for efficient learning. Fisher information ratio (FIR) has been used as an objective for selecting queries in active learning. However, little is known about the theory behind the use of FIR for active learning. There is a gap between the underlying theory and the motivation of its usage in practice. In this paper, we attempt to fill this gap and provide a rigorous framework for analyzing existing FIR-based active learning methods. In particular, we show that FIR can be asymptotically viewed as an upper bound of the expected variance of the log-likelihood ratio. Additionally, our analysis suggests a unifying framework that not only enables us to make theoretical comparisons among the existing querying methods based on FIR, but also allows us to give insight into the development of new active learning approaches based on this objective.
Two-sample testing in non-sparse high-dimensional linear models
In analyzing high-dimensional models, sparsity of the model parameter is a common but often undesirable assumption. In this paper, we study the following two-sample testing problem: given two samples generated by two high-dimensional linear models, we aim to test whether the regression coefficients of the two linear models are identical. We propose a framework named TIERS (short for TestIng Equality of Regression Slopes), which solves the two-sample testing problem without making any assumptions on the sparsity of the regression parameters. TIERS builds a new model by convolving the two samples in such a way that the original hypothesis translates into a new moment condition. A self-normalization construction is then developed to form a moment test. We provide rigorous theory for the developed framework. Under very weak conditions of the feature covariance, we show that the accuracy of the proposed test in controlling Type I errors is robust both to the lack of sparsity in the features and to the heavy tails in the error distribution, even when the sample size is much smaller than the feature dimension. Moreover, we discuss minimax optimality and efficiency properties of the proposed test. Simulation analysis demonstrates excellent finite-sample performance of our test. In deriving the test, we also develop tools that are of independent interest. The test is built upon a novel estimator, called Auto-aDaptive Dantzig Selector (ADDS), which not only automatically chooses an appropriate scale of the error term but also incorporates prior information. To effectively approximate the critical value of the test statistic, we develop a novel high-dimensional plug-in approach that complements the recent advances in Gaussian approximation theory.