cv fold
Joulani
Cross-validation (CV) is one of the main tools for performance estimation and parameter tuning in machine learning. The general recipe for computing CV estimate is to run a learning algorithm separately for each CV fold, a computationally expensive process. In this paper, we propose a new approach to reduce the computational burden of CV-based performance estimation. As opposed to all previous attempts, which are specific to a particular learning model or problem domain, we propose a general method applicable to a large class of incremental learning algorithms, which are uniquely fitted to big data problems. In particular, our method applies to a wide range of supervised and unsupervised learning tasks with different performance criteria, as long as the base learning algorithm is incremental. We show that the running time of the algorithm scales logarithmically, rather than linearly, in the number of CV folds. Furthermore, the algorithm has favorable properties for parallel and distributed implementation. Experiments with state-of-the-art incremental learning algorithms confirm the practicality of the proposed method.
Approximate Cross-validated Mean Estimates for Bayesian Hierarchical Regression Models
Zhang, Amy X., Bao, Le, Daniels, Michael J.
We introduce a novel procedure for obtaining cross-validated predictive estimates for Bayesian hierarchical regression models (BHRMs). Bayesian hierarchical models are popular for their ability to model complex dependence structures and provide probabilistic uncertainty estimates, but can be computationally expensive to run. Cross-validation (CV) is therefore not a common practice to evaluate the predictive performance of BHRMs. Our method circumvents the need to re-run computationally costly estimation methods for each cross-validation fold and makes CV more feasible for large BHRMs. By conditioning on the variance-covariance parameters, we shift the CV problem from probability-based sampling to a simple and familiar optimization problem. In many cases, this produces estimates which are equivalent to full CV. We provide theoretical results and demonstrate its efficacy on publicly available data and in simulations.
Proper Balancing for Cross Validation
Here we plot the precision results of balancing, with under-sampling, only the train set of each CV fold before fitting the model on it and making predictions on the CV fold's test set: Here we plot the precision results of balancing, with over-sampling, only the train set of each CV fold before fitting the model on it and making predictions on the CV fold's test set: It is clear, that balancing so far did not help in getting good test results. However, this is out of scope for this article (:-)) and the goal of this article is achieved: To make the model produce, on each CV fold's test set, evaluation metric scores similar to those that it would produce on an unknown one, for the case that the train data are balanced.