Goto

Collaborating Authors

 Regression


Distributed Coordinate Descent for L1-regularized Logistic Regression

arXiv.org Machine Learning

Solving logistic regression with L1-regularization in distributed settings is an important problem. This problem arises when training dataset is very large and cannot fit the memory of a single machine. We present d-GLMNET, a new algorithm solving logistic regression with L1-regularization in the distributed settings. We empirically show that it is superior over distributed online learning via truncated gradient.


$l_1$-regularized Outlier Isolation and Regression

arXiv.org Machine Learning

This paper proposed a new regression model called $l_1$-regularized outlier isolation and regression (LOIRE) and a fast algorithm based on block coordinate descent to solve this model. Besides, assuming outliers are gross errors following a Bernoulli process, this paper also presented a Bernoulli estimate model which, in theory, should be very accurate and robust due to its complete elimination of affections caused by outliers. Though this Bernoulli estimate is hard to solve, it could be approximately achieved through a process which takes LOIRE as an important intermediate step. As a result, the approximate Bernoulli estimate is a good combination of Bernoulli estimate's accuracy and LOIRE regression's efficiency with several simulations conducted to strongly verify this point. Moreover, LOIRE can be further extended to realize robust rank factorization which is powerful in recovering low-rank component from massive corruptions. Extensive experimental results showed that the proposed method outperforms state-of-the-art methods like RPCA and GoDec in the aspect of computation speed with a competitive performance.


Controlling false discoveries in high-dimensional situations: Boosting with stability selection

arXiv.org Machine Learning

Modern biotechnologies often result in high-dimensional data sets with much more variables than observations (n $\ll$ p). These data sets pose new challenges to statistical analysis: Variable selection becomes one of the most important tasks in this setting. We assess the recently proposed flexible framework for variable selection called stability selection. By the use of resampling procedures, stability selection adds a finite sample error control to high-dimensional variable selection procedures such as Lasso or boosting. We consider the combination of boosting and stability selection and present results from a detailed simulation study that provides insights into the usefulness of this combination. Limitations are discussed and guidance on the specification and tuning of stability selection is given. The interpretation of the used error bounds is elaborated and insights for practical data analysis are given. The results will be used to detect differentially expressed phenotype measurements in patients with autism spectrum disorders. All methods are implemented in the freely available R package stabs.


Bayesian feature selection with strongly-regularizing priors maps to the Ising Model

arXiv.org Machine Learning

Identifying small subsets of features that are relevant for prediction and/or classification tasks is a central problem in machine learning and statistics. The feature selection task is especially important, and computationally difficult, for modern datasets where the number of features can be comparable to, or even exceed, the number of samples. Here, we show that feature selection with Bayesian inference takes a universal form and reduces to calculating the magnetizations of an Ising model, under some mild conditions. Our results exploit the observation that the evidence takes a universal form for strongly-regularizing priors --- priors that have a large effect on the posterior probability even in the infinite data limit. We derive explicit expressions for feature selection for generalized linear models, a large class of statistical techniques that include linear and logistic regression. We illustrate the power of our approach by analyzing feature selection in a logistic regression-based classifier trained to distinguish between the letters B and D in the notMNIST dataset.


Predicting Rooftop Solar Adoption Using Agent-Based Modeling

AAAI Conferences

In this paper we present a novel agent-based modeling methodology to predict rooftop solar adoptions in the residential energy market. We first applied several linear regression models to estimate missing variables for non-adopters, so that attributes of non-adopters and adopters could be used to train a logistic regression model. Then, we integrated the logistic regression model along with other predictive models into a multi-agent simulation platform and validated our models by comparing the forecast of aggregate adoptions in a typical zip code area with its ground truth. This result shows that the agent-based model can reliably predict future adoptions. Finally, based on the validated agent-based model, we compared the outcome of a hypothesized seeding policy with the original incentive plan, and investigated other alternative seeding policies which could lead to more adopters.


Third Party-Owned PV Systems: Understanding Market Diffusion with Geospatial Tools

AAAI Conferences

Using geospatial methods, this paper informs the evolving field of research on the diffusion of residential Third Party Owned PV systems by analyzing 1) the spatial distribution of TPO systems, and 2) the influence of demographics on the adoption on the local level. This research is part of a multidisciplinary study into the diffusion of solar technology (SEEDS), using San Diego County as focus area. Our findings reveal a significant clustering of TPO PV adoption in San Diego County. TPO systems reached a similarly high market share across a large area in the central county in contrast to the installation of host-owned systems, which have been less evenly distributed across single-family households in the same area. The diffusion of TPO systems in San Diego County can be partially explained by looking at median income and percentage of people born in the US. The explanatory power of the model varies across the region.


Feedback Detection for Live Predictors

arXiv.org Machine Learning

A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local randomization scheme that can be used to detect non-linear feedback in real-world problems. We conduct a pilot study for our proposed methodology using a predictive system currently deployed as a part of a search engine.


Sparse principal component regression with adaptive loading

arXiv.org Machine Learning

Principal component analysis (PCA) (Jolliffe, 2002) is a fundamental statistical tool for dimensionality reduction, data processing, and visualization of multiv ariate data, with various applications in biology, engineering, and social science. In re gression analysis, it can be useful to replace many original explanatory variables with a f ew principal components, which is called the principal component regression (PCR) (Ma ssy, 1965; Jolliffe, 1982). PCR is widely used in various fields of research and many exten sions of PCR have been proposed (see, e.g., Hartnett et al., 1998; Rosital et al., 2001; Reiss and Ogden, 2007; Wang and Abbott, 2008). Whereas PCR is a useful tool for analyzin g multivariate data, this method may not have enough prediction accuracy if the respon se variable depends on the principal components with small eigenvalues. The problem arises from the two-stage procedure for PCR; a few principal components are selected with la rge eigenvalues, but without any relation to response variable, and then the regression model is constructed using them as new explanatory variables. In this paper, we deal with PCA and regression analysis simultaneous ly, and propose a one-stage procedure for PCR to address this problem. The proc edure combines two loss functions; one is the ordinary regression analysis loss and the othe r is PCA loss with some devices proposed by Zou et al. (2006).


A General Statistic Framework for Genome-based Disease Risk Prediction

arXiv.org Machine Learning

Advances of modern sensing and sequencing technologies generate a deluge of high dimensional space-temporal physiological and next-generation sequencing (NGS) data. Physiological traits are observed either as continuous random functions, or on a dense grid and referred to as function-valued traits. Both physiological and NGS data are highly correlated data with their inherent order, spacing, and functional nature which are ignored by traditional summary-based univariate and multivariate regression methods designed for quantitative genetic analysis of scalar trait and common variants. To capture morphological and dynamic features of the data and utilize their dependent structure, we propose a functional linear model (FLM) in which a trait curve is modeled as a response function, the genetic variation in a genomic region or gene is modeled as a functional predictor, and the genetic effects are modeled as a function of both time and genomic position (FLMF) for genetic analysis of function-valued trait with both GWAS and NGS data. By extensive simulations, we demonstrate that the FLMF has the correct type 1 error rates and much higher power to detect association than the existing methods. The FLMF is applied to sleep data from Starr County health studies where oxygen saturation were measured in 22,670 seconds on average for 833 individuals. We found 65 genes that were significantly associated with oxygen saturation functional trait with P-values ranging from 2.40E-06 to 2.53E-21. The results clearly demonstrate that the FLMF substantially outperforms the traditional genetic models with scalar trait.


A Greedy Homotopy Method for Regression with Nonconvex Constraints

arXiv.org Machine Learning

Constrained least squares regression is an essential tool for high-dimensional data analysis. Given a partition $\mathcal{G}$ of input variables, this paper considers a particular class of nonconvex constraint functions that encourage the linear model to select a small number of variables from a small number of groups in $\mathcal{G}$. Such constraints are relevant in many practical applications, such as Genome-Wide Association Studies (GWAS). Motivated by the efficiency of the Lasso homotopy method, we present RepLasso, a greedy homotopy algorithm that tries to solve the induced sequence of nonconvex problems by solving a sequence of suitably adapted convex surrogate problems. We prove that in some situations RepLasso recovers the global minima of the nonconvex problem. Moreover, even if it does not recover global minima, we prove that in relevant cases it will still do no worse than the Lasso in terms of support and signed support recovery, while in practice outperforming it. We show empirically that the strategy can also be used to improve over other Lasso-style algorithms. Finally, a GWAS of ankylosing spondylitis highlights our method's practical utility.