Avoid These Data Pitfalls When Moving Machine Learning Applications Into Production
How often have you heard "The Machine Learning Application worked well in the lab, but it failed in the field. It is not the fault of the Machine Learning Model! This blog is not yet another blog article (YABA) on DataOps, DevOps, MLOps, or CloudOps. I do not mean to imply xOps is not essential. For example, MLOps is both strategic and tactical. It promises to transform the "ad-hoc" delivery of Machine Learning applications into software engineering best practices. We know the symptoms: Most machine-learning models trained in the lab perform poorly on real-world data [1, 2, 3, 4]. Machine Learning created profits in the year 2020 and will continue to increase profits in the future. However, many problems hold back the progress and success of Machine Learning application rollout to production. I focus on what it is the most significant problem or cause: the quality and quantity of input data in Machine Learning models [1,4]. We realized the quantity of high-quality data was the bottleneck in predictive accuracy when we started showing near, or above, human-level performance in structured data, imagery, game playing, and natural language tasks. How many times do we look at the Machine Learning application lifecycle's conceptualization to realize a Machine Learning model is not at the beginning (Figure 2)? We can research and improve the tools of the Machine Learning application lifecycle. But that only lowers the cost of deployment. Arguably, the Machine Learning model's choice is not a critical part of deploying a Machine Learning application. We have a "good enough" process or pipeline to choose and change the Machine Learning model, given a training input dataset. However, when achieving State-of-the-Art (SOTA) results, the input data seems to have the most significant impact on the output predictive data (Figure 2). We seem to know the cause: input data that was garbage results in garbage output predictive data. New data input to a trained Machine Learning model determines the accuracy of the output. We divide Machine Learning input data into four arbitrary categories, defined by the Machine Learning application output accuracy. GPT-3 is an example [6]. GPT-3 trained with an enormous amount of data [6]. GPT-3 is frozen in time as a transformer that you access through an API. Concept Drift is a change in what to predict. For example, the definition of "what is a spammer." We do not cover Concept Drift here. I do not think of it as a problem but rather as a change in the solution's scope. An example of Case 2: Data Drift, is that Case 1: "It works!, is a temporal phenomenon.
Jun-22-2021, 14:10:50 GMT
- Technology: