Goto

Collaborating Authors

 Cross Validation


Cross validation residuals for generalised least squares and other correlated data models

arXiv.org Machine Learning

Cross validation residuals are well known for the ordinary least squares model. Here leave-M-out cross validation is extended to generalised least squares. The relationship between cross validation residuals and Cook's distance is demonstrated, in terms of an approximation to the difference in the generalised residual sum of squares for a model fit to all the data (training and test) and a model fit to a reduced dataset (training data only). For generalised least squares, as for ordinary least squares, there is no need to refit the model to reduced size datasets as all the values for K fold cross validation are available after fitting the model to all the data.


Using J-K fold Cross Validation to Reduce Variance When Tuning NLP Models

arXiv.org Machine Learning

K-fold cross validation (CV) is a popular method for estimating the true performance of machine learning models, allowing model selection and parameter tuning. However, the very process of CV requires random partitioning of the data and so our performance estimates are in fact stochastic, with variability that can be substantial for natural language processing tasks. We demonstrate that these unstable estimates cannot be relied upon for effective parameter tuning. The resulting tuned parameters are highly sensitive to how our data is partitioned, meaning that we often select sub-optimal parameter choices and have serious reproducibility issues. Instead, we propose to use the less variable J-K-fold CV, in which J independent K-fold cross validations are used to assess performance. Our main contributions are extending J-K-fold CV from performance estimation to parameter tuning and investigating how to choose J and K. We argue that variability is more important than bias for effective tuning and so advocate lower choices of K than are typically seen in the NLP literature, instead use the saved computation to increase J. To demonstrate the generality of our recommendations we investigate a wide range of case-studies: sentiment classification (both general and target-specific), part-of-speech tagging and document classification.


Cross-validation Tutorial: What, how and which?

#artificialintelligence

"Statistics [from cross-validation] are like bikinis. Training set Test set 2 4. P. Raamana Goals for Today • What is cross-validation? Training set Test set ℵ 2 5. P. Raamana Goals for Today • What is cross-validation? Training set Test set ℵ 2 6. Training set Test set ℵ negative bias unbiased positive bias 2 7. P. Raamana What is generalizability? Training set Test set 5 18. Training set Test set bigger training set better learning 5 19. Training set Test set bigger training set better learning better testing bigger test set 5 20. Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint.


How to check/optimize cross validation with randomforest on imbalanced classes ?

@machinelearnbot

Your data set is a bit small. The classic solution is to over-sample under-represented classes. I've been doing it routinely but on data sets with 50 million observations, where the class "fraud" (versus "non fraud") represented only 4 out of 10,000 observations. If you can get a much bigger data set, that would help. Also, with such as small, yet unbalanced data set, I would use less than 5 predictors.


Optimizing for Generalization in Machine Learning with Cross-Validation Gradients

arXiv.org Machine Learning

Cross-validation is the workhorse of modern applied statistics and machine learning, as it provides a principled framework for selecting the model that maximizes generalization performance. In this paper, we show that the cross-validation risk is differentiable with respect to the hyperparameters and training data for many common machine learning algorithms, including logistic regression, elastic-net regression, and support vector machines. Leveraging this property of differentiability, we propose a cross-validation gradient method (CVGM) for hyperparameter optimization. Our method enables efficient optimization in high-dimensional hyperparameter spaces of the cross-validation risk, the best surrogate of the true generalization ability of our learning algorithm.


Cross-validation in high-dimensional spaces: a lifeline for least-squares models and multi-class LDA

arXiv.org Machine Learning

Least-squares models such as linear regression and Linear Discriminant Analysis (LDA) are amongst the most popular statistical learning techniques. However, since their computation time increases cubically with the number of features, they are inefficient in high-dimensional neuroimaging datasets. Fortunately, for k-fold cross-validation, an analytical approach has been developed that yields the exact cross-validated predictions in least-squares models without explicitly training the model. Its computation time grows with the number of test samples. Here, this approach is systematically investigated in the context of cross-validation and permutation testing. LDA is used exemplarily but results hold for all other least-squares methods. Furthermore, a non-trivial extension to multi-class LDA is formally derived. The analytical approach is evaluated using complexity calculations, simulations, and permutation testing of an EEG/MEG dataset. Depending on the ratio between features and samples, the analytical approach is up to 10,000x faster than the standard approach (retraining the model on each training set). This allows for a fast cross-validation of least-squares models and multi-class LDA in high-dimensional data, with obvious applications in multi-dimensional datasets, Representational Similarity Analysis, and permutation testing.


[D] Cross Validation and t-SNE: How to combine all models for visualization? • r/MachineLearning

@machinelearnbot

I have a classifier that was trained with 10-fold cross validation. I have been averaging the predictions for the 10 models to calculate things like test-set accuracy. I wanted to visualize the feature vector of my classifier using t-SNE. What is the best way to combine all 10 models? I tried concatting all of my normalized feature vectors together across all 10 models, but t-SNE can just pick out all 10 of them and separate them that way.


Cross-Validation for Predictive Analytics Using R - MilanoR

#artificialintelligence

Since ancient times, humankind has always avidly sought a way to predict the future. One of the most widely known examples of this kind of activity in the past is the Oracle of Delphi, who dispensed previews of the future to her petitioners in the form of divine inspired prophecies1. In the modern days, the desire to know the future is still of interest to many of us, even if my feeling is that the increasing rapidity of technology innovations we observe everyday has somewhat lessened this instinct: things that few years ago seemed futuristic are now available to the great mass (e.g. the World Wide Web). Among the many areas of the human being where predictions are highly needed there is business decision making. The tools for formulating predictions about quantities of interest are commonly known as predictive analytics, which is itself an essential part of data science.


Training Sets, Test Sets, and 10-fold Cross-validation

@machinelearnbot

Editor's note: This is an excerpt from Ron Zacharski's freely available online book titled A Programmer's Guide to Data Mining: The Ancient Art of the Numerati. At the end of the previous chapter we worked with three different datasets: the women athlete dataset, the iris dataset, and the auto miles-per-gallon one. We divided each of these datasets in turn into two subsets. One subset we used to construct the classifier. This data set is called the training set.


Cross-Validation with Confidence

arXiv.org Machine Learning

Cross-validation is one of the most popular model selection methods in statistics and machine learning. Despite its wide applicability, traditional cross validation methods tend to select overfitting models, due to the ignorance of the uncertainty in the testing sample. We develop a new, statistically principled inference tool based on cross-validation that takes into account the uncertainty in the testing sample. This new method outputs a set of highly competitive candidate models containing the best one with guaranteed probability. As a consequence, our method can achieve consistent variable selection in a classical linear regression setting, for which existing cross-validation methods require unconventional split ratios. When used for regularizing tuning parameter selection, the method can provide a further trade-off between prediction accuracy and model interpretability. We demonstrate the performance of the proposed method in several simulated and real data examples.