Cross Validation
Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization
Tew, Shu Yu, Boley, Mario, Schmidt, Daniel F.
We present a novel method for tuning the regularization hyper-parameter, $\lambda$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite $n$ and thus requires the specification of a set of candidate $\lambda$, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough $n$, allowing for both the optimal $\lambda$ and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in $O(\min(n, p))$ operations, for input data with $n$ rows and $p$ columns. In contrast, evaluating a single value of $\lambda$ using fast LOOCV costs $O(n \min(n, p))$ operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of $l$ for $l$ candidate values for $\lambda$ (in the regime $q, p \in O(\sqrt{n})$ where $q$ is the number of regression targets).
Approximate Leave-one-out Cross Validation for Regression with $\ell_1$ Regularizers (extended version)
Auddy, Arnab, Zou, Haolin, Rad, Kamiar Rahnama, Maleki, Arian
The out-of-sample error (OO) is the main quantity of interest in risk estimation and model selection. Leave-one-out cross validation (LO) offers a (nearly) distribution-free yet computationally demanding approach to estimate OO. Recent theoretical work showed that approximate leave-one-out cross validation (ALO) is a computationally efficient and statistically reliable estimate of LO (and OO) for generalized linear models with differentiable regularizers. For problems involving non-differentiable regularizers, despite significant empirical evidence, the theoretical understanding of ALO's error remains unknown. In this paper, we present a novel theory for a wide class of problems in the generalized linear model family with non-differentiable regularizers. We bound the error |ALO - LO| in terms of intuitive metrics such as the size of leave-i-out perturbations in active sets, sample size n, number of features p and regularization parameters. As a consequence, for the $\ell_1$-regularized problems, we show that |ALO - LO| goes to zero as p goes to infinity while n/p and SNR are fixed and bounded.
Concentration inequalities for leave-one-out cross validation
Avelin, Benny, Viitasaari, Lauri
In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. We obtain our results by relying on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized/truncated estimators such as stabilized kernel regression. Mathematics Subject Classifications (2020): 62R07, 62G05, 60F15 Keywords: Leave-one-out cross validation, concentration inequalities, logarithmic Sobolev inequality, sub-Gaussian random variables 1. Introduction It is customary in many statistical and machine learning methods to use train-validation, where the data is split into a training set and a validation set. Then the training set is used for estimating (training) the model, while the validation set is used to measure the performance of the fitted model.
A Novel Approach for Machine Learning-based Load Balancing in High-speed Train System using Nested Cross Validation
Fifth-generation (5G) mobile communication networks have recently emerged in various fields, including highspeed trains. However, the dense deployment of 5G millimeter wave (mmWave) base stations (BSs) and the high speed of moving trains lead to frequent handovers (HOs), which can adversely affect the Quality-of-Service (QoS) of mobile users. As a result, HO optimization and resource allocation are essential considerations for managing mobility in high-speed train systems. In this paper, we model system performance of a high-speed train system with a novel machine learning (ML) approach that is nested cross validation scheme that prevents information leakage from model evaluation into the model parameter tuning, thereby avoiding overfitting and resulting in better generalization error. To this end, we employ ML methods for the high-speed train system scenario. Handover Margin (HOM) and Time-to-Trigger (TTT) values are used as features, and several KPIs are used as outputs, and several ML methods including Gradient Boosting Regression (GBR), Adaptive Boosting (AdaBoost), CatBoost Regression (CBR), Artificial Neural Network (ANN), Kernel Ridge Regression (KRR), Support Vector Regression (SVR), and k-Nearest Neighbor Regression (KNNR) are employed for the problem. Finally, performance comparisons of the cross validation schemes with the methods are made in terms of mean absolute error (MAE) and mean square error (MSE) metrics are made. As per obtained results, boosting methods, ABR, CBR, GBR, with nested cross validation scheme superiorly outperforms conventional cross validation scheme results with the same methods. On the other hand, SVR, KNRR, KRR, ANN with the nested scheme produce promising results for prediction of some KPIs with respect to their conventional scheme employment.
A Robust Machine Learning Approach for Path Loss Prediction in 5G Networks with Nested Cross Validation
Yazฤฑcฤฑ, Ibrahim, Gures, Emre
The design and deployment of fifth-generation (5G) wireless networks pose significant challenges due to the increasing number of wireless devices. Path loss has a landmark importance in network performance optimization, and accurate prediction of the path loss, which characterizes the attenuation of signal power during transmission, is critical for effective network planning, coverage estimation, and optimization. In this sense, we utilize machine learning (ML) methods, which overcome conventional path loss prediction models drawbacks, for path loss prediction in a 5G network system to facilitate more accurate network planning, resource optimization, and performance improvement in wireless communication systems. To this end, we utilize a novel approach, nested cross validation scheme, with ML to prevent overfitting, thereby getting better generalization error and stable results for ML deployment. First, we acquire a publicly available dataset obtained through a comprehensive measurement campaign conducted in an urban macro-cell scenario located in Beijing, China. The dataset includes crucial information such as longitude, latitude, elevation, altitude, clutter height, and distance, which are utilized as essential features to predict the path loss in the 5G network system. We deploy Support Vector Regression (SVR), CatBoost Regression (CBR), eXtreme Gradient Boosting Regression (XGBR), Artificial Neural Network (ANN), and Random Forest (RF) methods to predict the path loss, and compare the prediction results in terms of Mean Absolute Error (MAE) and Mean Square Error (MSE). As per obtained results, XGBR outperforms the rest of the methods. It outperforms CBR with a slight performance differences by 0.4 % and 1 % in terms of MAE and MSE metrics, respectively. On the other hand, it outperforms the rest of the methods with clear performance differences.
Corrected generalized cross-validation for finite ensembles of penalized estimators
Bellec, Pierre, Du, Jin-Hong, Koriyama, Takuya, Patil, Pratik, Tan, Kai
Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. In the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV.
Cross-Validation for Training and Testing Co-occurrence Network Inference Algorithms
Agyapong, Daniel, Propster, Jeffrey Ryan, Marks, Jane, Hocking, Toby Dylan
Microorganisms are found in almost every environment, including the soil, water, air, and inside other organisms, like animals and plants. While some microorganisms cause diseases, most of them help in biological processes such as decomposition, fermentation and nutrient cycling. A lot of research has gone into studying microbial communities in various environments and how their interactions and relationships can provide insights into various diseases. Co-occurrence network inference algorithms help us understand the complex associations of micro-organisms, especially bacteria. Existing network inference algorithms employ techniques such as correlation, regularized linear regression, and conditional dependence, which have different hyper-parameters that determine the sparsity of the network. Previous methods for evaluating the quality of the inferred network include using external data, and network consistency across sub-samples, both which have several drawbacks that limit their applicability in real microbiome composition data sets. We propose a novel cross-validation method to evaluate co-occurrence network inference algorithms, and new methods for applying existing algorithms to predict on test data. Our empirical study shows that the proposed method is useful for hyper-parameter selection (training) and comparing the quality of the inferred networks between different algorithms (testing).
Blocked Cross-Validation: A Precise and Efficient Method for Hyperparameter Tuning
Hyperparameter tuning plays a crucial role in optimizing the performance of predictive learners. Cross--validation (CV) is a widely adopted technique for estimating the error of different hyperparameter settings. Repeated cross-validation (RCV) has been commonly employed to reduce the variability of CV errors. In this paper, we introduce a novel approach called blocked cross-validation (BCV), where the repetitions are blocked with respect to both CV partition and the random behavior of the learner. Theoretical analysis and empirical experiments demonstrate that BCV provides more precise error estimates compared to RCV, even with a significantly reduced number of runs. We present extensive examples using real--world data sets to showcase the effectiveness and efficiency of BCV in hyperparameter tuning. Our results indicate that BCV outperforms RCV in hyperparameter tuning, achieving greater precision with fewer computations.
Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation
Du, Jin-Hong, Patil, Pratik, Kuchibhotla, Arun Kumar
We study subsampling-based ridge ensembles in the proportional asymptotics regime, where the feature size grows proportionally with the sample size such that their ratio converges to a constant. By analyzing the squared prediction risk of ridge ensembles as a function of the explicit penalty $\lambda$ and the limiting subsample aspect ratio $\phi_s$ (the ratio of the feature size to the subsample size), we characterize contours in the $(\lambda, \phi_s)$-plane at any achievable risk. As a consequence, we prove that the risk of the optimal full ridgeless ensemble (fitted on all possible subsamples) matches that of the optimal ridge predictor. In addition, we prove strong uniform consistency of generalized cross-validation (GCV) over the subsample sizes for estimating the prediction risk of ridge ensembles. This allows for GCV-based tuning of full ridgeless ensembles without sample splitting and yields a predictor whose risk matches optimal ridge risk.
Bootstrapping the Cross-Validation Estimate
Cai, Bryan, Pellegrini, Fabio, Pang, Menglan, de Moor, Carl, Shen, Changyu, Charu, Vivek, Tian, Lu
Cross-validation is a widely used technique for evaluating the performance of prediction models. It helps avoid the optimism bias in error estimates, which can be significant for models built using complex statistical learning algorithms. However, since the cross-validation estimate is a random value dependent on observed data, it is essential to accurately quantify the uncertainty associated with the estimate. This is especially important when comparing the performance of two models using cross-validation, as one must determine whether differences in error estimates are a result of chance fluctuations. Although various methods have been developed for making inferences on cross-validation estimates, they often have many limitations, such as stringent model assumptions This paper proposes a fast bootstrap method that quickly estimates the standard error of the cross-validation estimate and produces valid confidence intervals for a population parameter measuring average model performance. Our method overcomes the computational challenge inherent in bootstrapping the cross-validation estimate by estimating the variance component within a random effects model. It is just as flexible as the cross-validation procedure itself. To showcase the effectiveness of our approach, we employ comprehensive simulations and real data analysis across three diverse applications.