We can use a machine learning algorithm and feed it input data (emails) and it will automatically discover rules that are powerful enough to distinguish spam emails. The most common preprocessing steps are: removing missing values, converting categorical data into shape suitable for machine learning algorithm and feature scaling. For example, size (small, medium, large), we can order these sizes large medium small. For example, a sample with "Red" color is now encoded as (Red 1, Green 0, Blue 0) Assume we have data with two features one on a scale from 1 to 10 and the other on a scale from 1 to 1000.