Goto

Collaborating Authors

 Cross Validation


Approximate cross-validation formula for Bayesian linear regression

arXiv.org Machine Learning

Cross-validation (CV) is a technique for evaluating the ability of statistical models/learning systems based on a given data set. Despite its wide applicability, the rather heavy computational cost can prevent its use as the system size grows. To resolve this difficulty in the case of Bayesian linear regression, we develop a formula for evaluating the leave-one-out CV error approximately without actually performing CV. The usefulness of the developed formula is tested by statistical mechanical analysis for a synthetic model. This is confirmed by application to a real-world supernova data set as well.



Risk-consistency of cross-validation with lasso-type procedures

arXiv.org Machine Learning

The lasso and related sparsity inducing algorithms have been the target of substantial theoretical and applied research. Correspondingly, many results are known about their behavior for a fixed or optimally chosen tuning parameter specified up to unknown constants. In practice, however, this oracle tuning parameter is inaccessible so one must use the data to select one. Common statistical practice is to use a variant of cross-validation for this task. However, little is known about the theoretical properties of the resulting predictions with such data-dependent methods. We consider the high-dimensional setting with random design wherein the number of predictors $p$ grows with the number of observations $n$. Under typical assumptions on the data generating process, similar to those in the literature, we recover oracle rates up to a log factor when choosing the tuning parameter with cross-validation. Under weaker conditions, when the true model is not necessarily linear, we show that the lasso remains risk consistent relative to its linear oracle. We also generalize these results to the group lasso and square-root lasso and investigate the predictive and model selection performance of cross-validation via simulation.


Cross-validation in R: a do-it-yourself and a black box approach

@machinelearnbot

In my previous post, we saw that R-squared can lead to a misleading interpretation of the quality of our regression fit, in terms of prediction power. One thing that R-squared offers no protection against is overfitting. On the other hand, cross validation, by allowing us to have cases in our testing set that are different from the cases in our training set, inherently offers protection against overfittting. In this type of validation, one case in our data set is used as the test set, while the remaining cases are used as the training set. We iterate through the data set, until all cases have served as the test set.


Bootstrap and cross-validation for evaluating modelling strategies

#artificialintelligence

I've been re-reading Frank Harrell's Regression Modelling Strategies, a must read for anyone who ever fits a regression model, although be prepared - depending on your background, you might get 30 pages in and suddenly become convinced you've been doing nearly everything wrong before, which can be disturbing. I wanted to evaluate three simple modelling strategies in dealing with data with many variables. Using data with 54 variables on 1,785 area units from New Zealand's 2013 census, I'm looking to predict median income on the basis of the other 53 variables. The features are all continuous and are variables like "mean number of bedrooms", "proportion of individuals with no religion" and "proportion of individuals who are smokers". None of these is exactly what I would use for real, but they serve the purpose of setting up a competition of strategies that I can test with a variety of model validation techniques.


Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models

arXiv.org Machine Learning

The future predictive performance of a Bayesian model can be estimated using Bayesian cross-validation. In this article, we consider Gaussian latent variable models where the integration over the latent values is approximated using the Laplace method or expectation propagation (EP). We study the properties of several Bayesian leave-one-out (LOO) cross-validation approximations that in most cases can be computed with a small additional cost after forming the posterior approximation given the full data. Our main objective is to assess the accuracy of the approximative LOO cross-validation estimators. That is, for each method (Laplace and EP) we compare the approximate fast computation with the exact brute force LOO computation. Secondarily, we evaluate the accuracy of the Laplace and EP approximations themselves against a ground truth established through extensive Markov chain Monte Carlo simulation. Our empirical results show that the approach based upon a Gaussian approximation to the LOO marginal distribution (the so-called cavity distribution) gives the most accurate and reliable results among the fast methods.


How to test classifier better than chance using k-fold cross-validation? • /r/MachineLearning

@machinelearnbot

I have 400 units and 10 groups, and I'm classifying the units' group membership using a discriminant function analysis or linear discriminant analysis. During cross-validation, I want to test that my solution is doing a better job at classifying them than chance (10%). I can get an error rate, but don't know how to statistically compare. With the hold-out approach, I can test it using Press' Q statistic or Maximum Chance Criterion. But with k-fold I don't think I can use this approach.


A note on adjusting $R^2$ for using with cross-validation

arXiv.org Machine Learning

We show how to adjust the coefficient of determination ($R^2$) when used for measuring predictive accuracy via leave-one-out cross-validation.



How do you know if your model is going to work? Part 4: Cross-validation techniques

#artificialintelligence

In this article we conclude our four part series on basic model testing. When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this concluding Part 4 of our four part mini-series "How do you know if your model is going to work?" we demonstrate cross-validation techniques. Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.