SPlit: An Optimal Method for Data Splitting
Joseph, V. Roshan, Vakayil, Akhil
For developing statistical and machine learning models, it is common to split the dataset into two parts: training and testing (Stone, 1974; Hastie et al., 2009). The training part is used for fitting the model, that is, to estimate the unknown parameters in the model. The model is then evaluated for its accuracy using the testing dataset. The reason for doing this is because if we were to use the entire dataset for fitting, the model would overfit the data and can lead to poor predictions in future scenarios. Therefore, holding out a portion of the dataset and testing the model for its performance before deploying it in the field can protect against unexpected issues that can arise due to overfitting. In this article we consider only datasets where each row is independent, that is, we will exclude cases such as time series data. The simplest and probably the most common strategy to split such a dataset is to randomly sample a fraction of the dataset.
Dec-20-2020
- Country:
- Indian Ocean > Bass Strait (0.04)
- Oceania > Australia
- Tasmania (0.04)
- North America > United States
- Wisconsin (0.04)
- New York (0.04)
- District of Columbia > Washington (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Florida > Palm Beach County
- Boca Raton (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Europe > Sweden
- Genre:
- Research Report (0.64)
- Technology: