Regression
Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
Cawley, Gavin C., Talbot, Nicola L., Girolami, Mark
Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multi-class pattern recognition problems. More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using cross-validation, but at greatly reduced computational expense.
Mixture Regression for Covariate Shift
Sugiyama, Masashi, Storkey, Amos J.
In supervised learning there is a typical presumption that the training and test points are taken from the same distribution. In practice this assumption is commonly violated.The situations where the training and test data are from different distributions is called covariate shift. Recent work has examined techniques for dealing with covariate shift in terms of minimisation of generalisation error. As yet the literature lacks a Bayesian generative perspective on this problem. This paper tackles this issue for regression models. Recent work on covariate shift can be understood in terms of mixture regression. Using this view, we obtain a general approach to regression under covariate shift, which reproduces previous work as a special case. The main advantages of this new formulation over previous models forcovariate shift are that we no longer need to presume the test and training densities are known, the regression and density estimation are combined into a single procedure, and previous methods are reproduced as special cases of this procedure, shedding light on the implicit assumptions the methods are making.
Speakers optimize information density through syntactic reduction
If language users are rational, they might choose to structure their utterances so as to optimize communicative properties. In particular, information-theoretic and psycholinguistic considerations suggest that this may include maximizing the uniformity ofinformation density in an utterance. We investigate this possibility in the context of syntactic reduction, where the speaker has the option of either marking a higher-order unit (a phrase) with an extra word, or leaving it unmarked. We demonstrate that speakers are more likely to reduce less information-dense phrases. In a second step, we combine a stochastic model of structured utterance production with a logistic-regression model of syntactic reduction to study which types of cues speakers employ when estimating the predictability of upcoming elements. We demonstrate that the trend toward predictability-sensitive syntactic reduction (Jaeger, 2006) is robust in the face of a wide variety of control variables, andpresent evidence that speakers use both surface and structural cues for predictability estimation.
Map-Reduce for Machine Learning on Multicore
Chu, Cheng-tao, Kim, Sang K., Lin, Yi-an, Yu, Yuanyuan, Bradski, Gary, Olukotun, Kunle, Ng, Andrew Y.
We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallelprogramming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain "summation form," which allows them to be easily parallelized onmulticore computers. We adapt Google's map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR),naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.
Kernels and Ensembles: Perspectives on Statistical Learning
Since their emergence in the 1990's, the support vector machine and the AdaBoost algorithm have spawned a wave of research in statistical machine learning. Much of this new research falls into one of two broad categories: kernel methods and ensemble methods. In this expository article, I discuss the main ideas behind these two types of methods, namely how to transform linear algorithms into nonlinear ones by using kernel functions, and how to make predictions with an ensemble or a collection of models rather than a single model. I also share my personal perspectives on how these ideas have influenced and shaped my own research. In particular, I present two recent algorithms that I have invented with my collaborators: LAGO, a fast kernel algorithm for unbalanced classification and rare target detection; and Darwinian evolution in parallel universes, an ensemble method for variable selection.
Autoencoder, Principal Component Analysis and Support Vector Regression for Data Imputation
Marivate, Vukosi N., Nelwamodo, Fulufhelo V., Marwala, Tshilidzi
Data collection often results in records that have missing values or variables. This investigation compares 3 different data imputation models and identifies their merits by using accuracy measures. Autoencoder Neural Networks, Principal components and Support Vector regression are used for prediction and combined with a genetic algorithm to then impute missing variables. The use of PCA improves the overall performance of the autoencoder network while the use of support vector regression shows promising potential for future investigation. Accuracies of up to 97.4 % on imputation of some of the variables were achieved.
Piecewise linear regularized solution paths
We consider the generic regularized optimization problem $\hat{\mathsf{\beta}}(\lambda)=\arg \min_{\beta}L({\sf{y}},X{\sf{\beta}})+\lambda J({\sf{\beta}})$. Efron, Hastie, Johnstone and Tibshirani [Ann. Statist. 32 (2004) 407--499] have shown that for the LASSO--that is, if $L$ is squared error loss and $J(\beta)=\|\beta\|_1$ is the $\ell_1$ norm of $\beta$--the optimal coefficient path is piecewise linear, that is, $\partial \hat{\beta}(\lambda)/\partial \lambda$ is piecewise constant. We derive a general characterization of the properties of (loss $L$, penalty $J$) pairs which give piecewise linear coefficient paths. Such pairs allow for efficient generation of the full regularized coefficient paths. We investigate the nature of efficient path following algorithms which arise. We use our results to suggest robust versions of the LASSO for regression and classification, and to develop new, efficient algorithms for existing problems in the literature, including Mammen and van de Geer's locally adaptive regression splines.