SPlit: An Optimal Method for Data Splitting

Dec-20-2020–arXiv.org Machine Learning

For developing statistical and machine learning models, it is common to split the dataset into two parts: training and testing (Stone, 1974; Hastie et al., 2009). The training part is used for fitting the model, that is, to estimate the unknown parameters in the model. The model is then evaluated for its accuracy using the testing dataset. The reason for doing this is because if we were to use the entire dataset for fitting, the model would overfit the data and can lead to poor predictions in future scenarios. Therefore, holding out a portion of the dataset and testing the model for its performance before deploying it in the field can protect against unexpected issues that can arise due to overfitting. In this article we consider only datasets where each row is independent, that is, we will exclude cases such as time series data. The simplest and probably the most common strategy to split such a dataset is to randomly sample a fraction of the dataset.

categorical variable, dataset, support point, (14 more...)

arXiv.org Machine Learning

Dec-20-2020

arXiv.org PDF

Add feedback

Country:
- Indian Ocean > Bass Strait (0.04)
- Oceania > Australia
  - Tasmania (0.04)
- North America > United States
  - Wisconsin (0.04)
  - New York (0.04)
  - District of Columbia > Washington (0.04)
  - Georgia > Fulton County
    - Atlanta (0.04)
  - Florida > Palm Beach County
    - Boca Raton (0.04)
  - California > Santa Clara County
    - Palo Alto (0.04)
- Europe > Sweden
  - Stockholm > Stockholm (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (1.00)
  - Machine Learning > Statistical Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found