Splitting Data For Machine Learning, Using T-SQL – Hydrate Consulting LLC

#artificialintelligence 

It is a common practice in data science to split our data up into three separate datasets in order to perform machine learning (ML) on it. The majority of the data will be used to Train our ML models and then a portion of the remaining data will be used to Validate those models. The best of these models will then be applied to our Training data to evaluate its performance. Graphical ML tools such as Azure Machine Learning often provide easily configurable drag-and-drop tools to split our data in this manner, but what happens if we are working on a custom solution, perhaps using something like SQL Server's In-Database Machine Learning? In this blog post we'll look at a couple of different T-SQL solutions to quickly split our data randomly into these datasets.