SPlit: An Optimal Method for Data Splitting

Joseph, V. Roshan, Vakayil, Akhil

arXiv.org Machine Learning 

For developing statistical and machine learning models, it is common to split the dataset into two parts: training and testing (Stone, 1974; Hastie et al., 2009). The training part is used for fitting the model, that is, to estimate the unknown parameters in the model. The model is then evaluated for its accuracy using the testing dataset. The reason for doing this is because if we were to use the entire dataset for fitting, the model would overfit the data and can lead to poor predictions in future scenarios. Therefore, holding out a portion of the dataset and testing the model for its performance before deploying it in the field can protect against unexpected issues that can arise due to overfitting. In this article we consider only datasets where each row is independent, that is, we will exclude cases such as time series data. The simplest and probably the most common strategy to split such a dataset is to randomly sample a fraction of the dataset.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found