7 Steps to Mastering Data Preparation for Machine Learning with Python -- 2019 Edition


Whatever term you choose, they refer to a roughly related set of pre-modeling data activities in the machine learning, data mining, and data science communities. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting. This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data using algorithms (e.g. I would say that it is "identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data" in the context of "mapping data from one'raw' form into another..." all the way up to "training a statistical model" which I like to think of data preparation as encompassing, or "everything from data sourcing right up to, but not including, model building."