Goto

Collaborating Authors

 Cross Validation


Subject Cross Validation in Human Activity Recognition

arXiv.org Machine Learning

K-fold Cross Validation is commonly used to evaluate classifiers and tune their hyperparameters. However, it assumes that data points are Independent and Identically Distributed (i.i.d.) so that samples used in the training and test sets can be selected randomly and uniformly. In Human Activity Recognition datasets, we note that the samples produced by the same subjects are likely to be correlated due to diverse factors. Hence, k-fold cross validation may overestimate the performance of activity recognizers, in particular when overlapping sliding windows are used. In this paper, we investigate the effect of Subject Cross Validation on the performance of Human Activity Recognition, both with non-overlapping and with overlapping sliding windows. Results show that k-fold cross validation artificially increases the performance of recognizers by about 10%, and even by 16% when overlapping windows are used. In addition, we do not observe any performance gain from the use of overlapping windows. We conclude that Human Activity Recognition systems should be evaluated by Subject Cross Validation, and that overlapping windows are not worth their extra computational cost.


Cross validation in sparse linear regression with piecewise continuous nonconvex penalties and its acceleration

arXiv.org Machine Learning

We investigate the signal reconstruction performance of sparse linear regression in the presence of noise when piecewise continuous nonconvex penalties are used. Among such penalties, we focus on the smoothly clipped absolute deviation (SCAD) penalty. The contributions of this study are three-fold: We first present a theoretical analysis of a typical reconstruction performance, using the replica method, under the assumption that each component of the design matrix is given as an independent and identically distributed (i.i.d.) Gaussian variable. This clarifies the superiority of the SCAD estimator compared with $\ell_1$ in a wide parameter range, although the nonconvex nature of the penalty tends to lead to solution multiplicity in certain regions. This multiplicity is shown to be connected to replica symmetry breaking in the spin-glass theory, and associated phase diagrams are given. We also show that the global minimum of the mean square error between the estimator and the true signal is located in the replica symmetric phase. Second, we develop an approximate formula efficiently computing the cross-validation error without actually conducting the cross-validation, which is also applicable to the non-i.i.d. design matrices. It is shown that this formula is only applicable to the unique solution region and tends to be unstable in the multiple solution region. We implement instability detection procedures, which allows the approximate formula to stand alone and resultantly enables us to draw phase diagrams for any specific dataset. Third, we propose an annealing procedure, called nonconvexity annealing, to obtain the solution path efficiently. Numerical simulations are conducted on simulated datasets to examine these results to verify the consistency of the theoretical results and the efficiency of the approximate formula and nonconvexity annealing.


Efficient Cross-Validation for Semi-Supervised Learning

arXiv.org Machine Learning

Manifold regularization, such as laplacian regularized least squares (LapRLS) and laplacian support vector machine (LapSVM), has been widely used in semi-supervised learning, and its performance greatly depends on the choice of some hyper-parameters. Cross-validation (CV) is the most popular approach for selecting the optimal hyper-parameters, but it has high complexity due to multiple times of learner training. In this paper, we provide a method to approximate the CV for manifold regularization based on a notion of robust statistics, called Bouligand influence function (BIF). We first provide a strategy for approximating the CV via the Taylor expansion of BIF. Then, we show how to calculate the BIF for general loss function,and further give the approximate CV criteria for model selection in manifold regularization. The proposed approximate CV for manifold regularization requires training only once, hence can significantly improve the efficiency of traditional CV. Experimental results show that our approximate CV has no statistical discrepancy with the original one, but much smaller time cost.


Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation

arXiv.org Machine Learning

Cross-validation of predictive models is the de-facto standard for model selection and evaluation. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo a preliminary data-dependent transformation, such as feature rescaling or dimensionality reduction, prior to cross-validation. It is widely believed that such a preprocessing stage, if done in an unsupervised manner that does not consider the class labels or response values, has no effect on the validity of cross-validation. In this paper, we show that this belief is not true. Preliminary preprocessing can introduce either a positive or negative bias into the estimates of model performance. Thus, it may lead to sub-optimal choices of model parameters and invalid inference. In light of this, the scientific community should re-examine the use of preliminary preprocessing prior to cross-validation across the various application domains. By default, all data transformations, including unsupervised preprocessing stages, should be learned only from the training samples, and then merely applied to the validation and testing samples.


Dealing with imbalanced data: undersampling, oversampling and proper cross-validation

#artificialintelligence

Inside the cross-validation loop, get a sample out and do not use it for anything related to features selection, oversampling or model building. Oversample your minority class, without the sample you already excluded. Use the excluded sample for validation, and the oversampled minority class the majority class, to create the model. Repeat n times, where n is your number of samples (if doing leave one participant out cross-validation). Inside the cross-validation loop, get a sample out and do not use it for anything related to features selection, oversampling or model building. Oversample your minority class, without the sample you already excluded. Use the excluded sample for validation, and the oversampled minority class the majority class, to create the model. Repeat n times, where n is your number of samples (if doing leave one participant out cross-validation).


Advanced cross-validation tips for time series

#artificialintelligence

The blue line, suddenly dropping to zero at t 80, is our modified target. The green line plots our predictions for that modified target, and its diverging from the red line (plotting our predictions for the original target) after t 80 reveals that we are suffering from feature leakage (without feature leakage, the green and red line would have diverged after t 94).


On Cross-validation for Sparse Reduced Rank Regression

arXiv.org Machine Learning

In high-dimensional data analysis, regularization methods pursuing sparsity and/or low rank have received a lot of attention recently. To provide a proper amount of shrinkage, it is typical to use a grid search and a model comparison criterion to find the optimal regularization parameters. However, we show that fixing the parameters across all folds may result in an inconsistency issue, and it is more appropriate to cross-validate projection-selection patterns to obtain the best coefficient estimate. Our in-sample error studies in jointly sparse and rank-deficient models lead to a new class of information criteria with four scale-free forms to bypass the estimation of the noise level. By use of an identity, we propose a novel scale-free calibration to help cross-validation achieve the minimax optimal error rate non-asymptotically. Experiments support the efficacy of the proposed methods.


Improve Your Model Performance using Cross Validation (in Python / R)

#artificialintelligence

This article was originally published on November 18, 2015 and updated on April 30, 2018. One of the most interesting and challenging things about hackathons is getting a high score on both public and private leaderboards. I have closely monitored the series of Data Hackathons and found an interesting trend. This trend is based on participant rankings on the public and private leaderboards. One thing that stood out was that participants who rank higher on the public leaderboard lose their position after their ranks gets validated on the private leaderboard.


Nested cross-validation when selecting classifiers is overzealous for most practical applications

arXiv.org Machine Learning

Abstract--When selecting a classification algorithm to be applied to a particular problem, one has to simultaneously select the best algorithm for that dataset and the best set of hyperparameters for the chosen model. The usual approach is to apply a nested cross-validation procedure; hyperparameter selection is performed in the inner crossvalidation, while the outer cross-validation computes an unbiased estimate of the expected accuracy of the algorithm with cross-validation based hyperparameter tuning. The alternative approach, which we shall call "flat cross-validation", uses a single cross-validation step both to select the optimal hyperparameter values and to provide an estimate of the expected accuracy of the algorithm, that while biased may nevertheless still be used to select the best learning algorithm. We tested both procedures using 12 different algorithms on 115 real life binary datasets and conclude that using the less computationally expensive flat crossvalidation procedure will generally result in the selection of an algorithm that is, for all practical purposes, of similar quality to that selected via nested cross-validation, provided the learning algorithms have relatively few hyperparameters to be optimised. A practitioner who builds a classification model has to select the best algorithm for that particular problem. There are hundreds of classification algorithms described in the literature, such as k-nearest neighbour [1], SVM [2], neural networks [3], naïve Bayes [4], gradient boosting machines [5], and so on.


Cross validation residuals for generalised least squares and other correlated data models

arXiv.org Machine Learning

Cross validation residuals are well known for the ordinary least squares model. Here leave-M-out cross validation is extended to generalised least squares. The relationship between cross validation residuals and Cook's distance is demonstrated, in terms of an approximation to the difference in the generalised residual sum of squares for a model fit to all the data (training and test) and a model fit to a reduced dataset (training data only). For generalised least squares, as for ordinary least squares, there is no need to refit the model to reduced size datasets as all the values for K fold cross validation are available after fitting the model to all the data.