Goto

Collaborating Authors

 Cross Validation


Fast calculation of Gaussian Process multiple-fold cross-validation residuals and their covariances

arXiv.org Machine Learning

We generalize fast Gaussian process leave-one-out formulae to multiple-fold cross-validation, highlighting in turn in broad settings the covariance structure of cross-validation residuals. The employed approach, that relies on block matrix inversion via Schur complements, is applied to both Simple and Universal Kriging frameworks. We illustrate how resulting covariances affect model diagnostics and how to properly transform residuals in the first place. Beyond that, we examine how accounting for dependency between such residuals affect cross-validation-based estimation of the scale parameter. It is found in two distinct cases, namely in scale estimation and in broader covariance parameter estimation via pseudo-likelihood, that correcting for covariances between cross-validation residuals leads back to maximum likelihood estimation or to an original variation thereof. The proposed fast calculation of Gaussian Process multiple-fold cross-validation residuals is implemented and benchmarked against a naive implementation, all in R language. Numerical experiments highlight the accuracy of our approach as well as the substantial speed-ups that it enables. It is noticeable however, as supported by a discussion on the main drivers of computational costs and by a dedicated numerical benchmark, that speed-ups steeply decline as the number of folds (say, all sharing the same size) decreases. Overall, our results enable fast multiple-fold cross-validation, have direct consequences in GP model diagnostics, and pave the way to future work on hyperparameter fitting as well as on the promising field of goal-oriented fold design.


Cross-Validation and Uncertainty Determination for Randomized Neural Networks with Applications to Mobile Sensors

arXiv.org Machine Learning

Randomized artificial neural networks such as extreme learning machines provide an attractive and efficient method for supervised learning under limited computing ressources and green machine learning. This especially applies when equipping mobile devices (sensors) with weak artificial intelligence. Results are discussed about supervised learning with such networks and regression methods in terms of consistency and bounds for the generalization and prediction error. Especially, some recent results are reviewed addressing learning with data sampled by moving sensors leading to non-stationary and dependent samples. As randomized networks lead to random out-of-sample performance measures, we study a cross-validation approach to handle the randomness and make use of it to improve out-of-sample performance. Additionally, a computationally efficient approach to determine the resulting uncertainty in terms of a confidence interval for the mean out-of-sample prediction error is discussed based on two-stage estimation. The approach is applied to a prediction problem arising in vehicle integrated photovoltaics.


Large Non-Stationary Noisy Covariance Matrices: A Cross-Validation Approach

arXiv.org Machine Learning

We introduce a novel covariance estimator that exploits the heteroscedastic nature of financial time series by employing exponential weighted moving averages and shrinking the in-sample eigenvalues through cross-validation. Our estimator is model-agnostic in that we make no assumptions on the distribution of the random entries of the matrix or structure of the covariance matrix. Additionally, we show how Random Matrix Theory can provide guidance for automatic tuning of the hyperparameter which characterizes the time scale for the dynamics of the estimator. By attenuating the noise from both the cross-sectional and time-series dimensions, we empirically demonstrate the superiority of our estimator over competing estimators that are based on exponentially-weighted and uniformly-weighted covariance matrices.


Proper Model Selection through Cross Validation

#artificialintelligence

So, what is cross validation? Recalling my post about model selection, where we saw that it may be necessary to split data into three different portions, one for training, one for validation (to choose among models) and eventually measure the true accuracy through the last data portion. This procedure is one viable way to choose the best among several models. Cross validation (CV) is not too different from this idea, but deals with the model training/validation in quite a smart way. For CV we use a larger combined training and validation data set, followed by a testing dataset.


Optimizing Approximate Leave-one-out Cross-validation to Tune Hyperparameters

arXiv.org Machine Learning

For a large class of regularized models, leave-one-out cross-validation can be efficiently estimated with an approximate leave-one-out formula (ALO). We consider the problem of adjusting hyperparameters so as to optimize ALO. We derive efficient formulas to compute the gradient and hessian of ALO and show how to apply a second-order optimizer to find hyperparameters. We demonstrate the usefulness of the proposed approach by finding hyperparameters for regularized logistic regression and ridge regression on various real-world data sets.


Cross-validation Confidence Intervals for Test Error

arXiv.org Machine Learning

This work develops central limit theorems for cross-validation and consistent estimators of its asymptotic variance under weak stability conditions on the learning algorithm. Together, these results provide practical, asymptotically-exact confidence intervals for $k$-fold test error and valid, powerful hypothesis tests of whether one learning algorithm has smaller $k$-fold test error than another. These results are also the first of their kind for the popular choice of leave-one-out cross-validation. In our real-data experiments with diverse learning algorithms, the resulting intervals and tests outperform the most popular alternative methods from the literature.


Cross Validation Machine Learning: K-Fold

#artificialintelligence

Cross-validation is used to evaluate machine learning models on a limited data sample.It estimates the skill of a machine learning model on unseen data. The techniques creates and validates given model multiple times. We have 2–4 types of cross validation like Stratified, LOOCV, K-Fold etc. Here, we will study K-Fold technique. Let's split data 70:30, train model and test the given data-set to get accuracy.


Cross-validation and hyperparameter tuning

#artificialintelligence

Almost every machine learning algorithm comes with a large number of settings that we, the machine learning researchers and practitioners, need to specify. These tuning knobs, the so-called hyperparameters, help us control the behavior of machine learning algorithms when optimizing for performance, finding the right balance between bias and variance. Hyperparameter tuning for performance optimization is an art in itself, and there are no hard-and-fast rules that guarantee best performance on a given dataset. In Part I and Part II, we saw different holdout and bootstrap techniques for estimating the generalization performance of a model. We learned about the bias-variance trade-off, and we computed the uncertainty of our estimates.


When to Impute? Imputation before and during cross-validation

arXiv.org Machine Learning

Cross-validation (CV) is a technique used to estimate generalization error for prediction models. For pipeline modeling algorithms (i.e. modeling procedures with multiple steps), it has been recommended the entire sequence of steps be carried out during each replicate of CV to mimic the application of the entire pipeline to an external testing set. While theoretically sound, following this recommendation can lead to high computational costs when a pipeline modeling algorithm includes computationally expensive operations, e.g. imputation of missing values. There is a general belief that unsupervised variable selection (i.e. ignoring the outcome) can be applied before conducting CV without incurring bias, but there is less consensus for unsupervised imputation of missing values. We empirically assessed whether conducting unsupervised imputation prior to CV would result in biased estimates of generalization error or result in poorly selected tuning parameters and thus degrade the external performance of downstream models. Results show that despite optimistic bias, the reduced variance of imputation before CV compared to imputation during each replicate of CV leads to a lower overall root mean squared error for estimation of the true external R-squared and the performance of models tuned using CV with imputation before versus during each replication is minimally different. In conclusion, unsupervised imputation before CV appears valid in certain settings and may be a helpful strategy that enables analysts to use more flexible imputation techniques without incurring high computational costs.


Machine Learning: Some notes about Cross-Validation

#artificialintelligence

K-fold cross-validation is one of the most used cross-validation methods. In this method, k represents the number of experiments(or fold) that I want to try in order to test and train my data. For example, suppose that we want to make 5 experiments(or performance) with our data composed of 1000 records. So during the first experiment, we test or validate the first 200 records and then we train the remaining 800 records. When the first experiment is finished I obtain a certain accuracy.