Cross Validation
Is Cross-validation the Gold Standard to Estimate Out-of-sample Model Performance?
Cross-Validation (CV) is the default choice for estimate the out-of-sample performance of machine learning models. Despite its wide usage, their statistical benefits have remained half-understood, especially in challenging nonparametric regimes. In this paper we fill in this gap and show that, in terms of estimating the out-of-sample performances, for a wide spectrum of models, CV does not statistically outperform the simple plug-in'' approach where one reuses training data for testing evaluation. Specifically, in terms of both the asymptotic bias and coverage accuracy of the associated interval for out-of-sample evaluation, K -fold CV provably cannot outperform plug-in regardless of the rate at which the parametric or nonparametric models converge. Leave-one-out CV can have a smaller bias as compared to plug-in; however, this bias improvement is negligible compared to the variability of the evaluation, and in some important cases leave-one-out again does not outperform plug-in once this variability is taken into account.
Cross-validation Confidence Intervals for Test Error
This work develops central limit theorems for cross-validation and consistent estimators of the asymptotic variance under weak stability conditions on the learning algorithm. Together, these results provide practical, asymptotically-exact confidence intervals for k-fold test error and valid, powerful hypothesis tests of whether one learning algorithm has smaller k-fold test error than another. These results are also the first of their kind for the popular choice of leave-one-out cross-validation. In our experiments with diverse learning algorithms, the resulting intervals and tests outperform the most popular alternative methods from the literature.
Approximate Cross-Validation for Structured Models
Many modern data analyses benefit from explicitly modeling dependence structure in data -- such as measurements across time or space, ordered words in a sentence, or genes in a genome. A gold standard evaluation technique is structured cross-validation (CV), which leaves out some data subset (such as data within a time interval or data in a geographic region) in each fold. But CV here can be prohibitively slow due to the need to re-run already-expensive learning algorithms many times. Previous work has shown approximate cross-validation (ACV) methods provide a fast and provably accurate alternative in the setting of empirical risk minimization. But this existing ACV work is restricted to simpler models by the assumptions that (i) data across CV folds are independent and (ii) an exact initial model fit is available. In structured data analyses, both these assumptions are often untrue.
Weighted Leave-One-Out Cross Validation
Pronzato, Luc, Rendas, Maria-João
We present a weighted version of Leave-One-Out (LOO) cross-validation for estimating the Integrated Squared Error (ISE) when approximating an unknown function by a predictor that depends linearly on evaluations of the function over a finite collection of sites. The method relies on the construction of the best linear estimator of the squared prediction error at an arbitrary unsampled site based on squared LOO residuals, assuming that the function is a realization of a Gaussian Process (GP). A theoretical analysis of performance of the ISE estimator is presented, and robustness with respect to the choice of the GP kernel is investigated first analytically, then through numerical examples. Overall, the estimation of ISE is significantly more precise than with classical, unweighted, LOO cross validation. Application to model selection is briefly considered through examples.
- North America > United States > Ohio (0.04)
- North America > United States > New York (0.04)
- Europe > France (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.81)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Stability Regularized Cross-Validation
Cory-Wright, Ryan, Gómez, Andrés
We revisit the problem of ensuring strong test-set performance via cross-validation. Motivated by the generalization theory literature, we propose a nested k-fold cross-validation scheme that selects hyperparameters by minimizing a weighted sum of the usual cross-validation metric and an empirical model-stability measure. The weight on the stability term is itself chosen via a nested cross-validation procedure. This reduces the risk of strong validation set performance and poor test set performance due to instability. We benchmark our procedure on a suite of 13 real-world UCI datasets, and find that, compared to k-fold cross-validation over the same hyperparameters, it improves the out-of-sample MSE for sparse ridge regression and CART by 4% on average, but has no impact on XGBoost. This suggests that for interpretable and unstable models, such as sparse regression and CART, our approach is a viable and computationally affordable method for improving test-set performance.
Can we globally optimize cross-validation loss? Quasiconvexity in ridge regression
Models like LASSO and ridge regression are extensively used in practice due to their interpretability, ease of use, and strong theoretical guarantees. Cross-validation (CV) is widely used for hyperparameter tuning in these models, but do practical methods minimize the true out-of-sample loss? A recent line of research promises to show that the optimum of the CV loss matches the optimum of the out-of-sample loss (possibly after simple corrections). It remains to show how tractable it is to minimize the CV loss.In the present paper, we show that, in the case of ridge regression, the CV loss may fail to be quasiconvex and thus may have multiple local optima. We can guarantee that the CV loss is quasiconvex in at least one case: when the spectrum of the covariate matrix is nearly flat and the noise in the observed responses is not too high.
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.65)
Review for NeurIPS paper: Cross-validation Confidence Intervals for Test Error
Weaknesses: Some major comments: 1) The connection to algorithmic stability is interesting, but I am not convinced that this can deliver as strong results as we would like beyond what can already be achieved through standard results/analysis. More specifically, algorithmic stability has mostly shown O(1/n) results for ERM or SGD, but this is just a rehashing of standard results, essentially following from iid-ness, that is, that every datapoint contributes the same information on average. This is not a problem with the current paper per se, but more a critique of algorithmic stability analysis. Rather, my concern for the current paper is twofold: a) the connection to algorithmic stability cannot deliver, as far as I understand, any stronger results than what is already possible through standard methods; b) and thus a basic CLT for CV error is attainable through a more standard analysis. Indeed, the path to asymptotic normality is pretty straightforward in the paper, since all important steps are more-or-less assumed: Square integrability of mean loss \bar h_n, song convexity of such loss function which guarantees O(1/n) rates, etc. 2) The experimental setup is very confusing to me.
Review for NeurIPS paper: Cross-validation Confidence Intervals for Test Error
The reviewers were all rather positive about the theoretical contribution, although one minority negative review (R1) gave a low score due an the experimental setup deemed unconvincing. Overall I recommend acceptance, possibly asking the authors to make some revisions to the experimental section to address some criticisms of R1.
Review for NeurIPS paper: Approximate Cross-Validation with Low-Rank Data in High Dimensions
Weaknesses: I think the significance of the results (maybe because of the delivery of the result) is below the threshold of acceptance. 1) The first weakness is that there is no discussion about whether the upper bound (mentioned in the strengths) is tight and when this upper bound implies consistency, i,e., the error goes to 0 under a certain limit. Note that the norm of the true signal, the scale of the feature matrix, and the best tuning parameter need to satisfy certain order conditions such that the problem becomes meaningful. A common approach is to apply PCA and do feature selection first. Then, the authors should compare their results with prior works on the selected features. After response: I noticed corollary 1 and corollary 2. But these two corollaries together only cover the trivial case when sample size goes to infinity while the rank of feature matrix is bounded by constant.
Review for NeurIPS paper: Approximate Cross-Validation with Low-Rank Data in High Dimensions
Two reviewers agree that this submission represents an important contribution to the field. However, a third expressed significant concerns about the tightness of the presented bounds, the accommodation of matrices with growing rank, and behavior in the presence of principal component preprocessing. Please be sure to carefully review and address the concerns of all reviewers in the revision.