The normal life cycle of a machine learning model includes several stages, see Figure 1. There are countless online courses and articles about preparing the data and building models but there is much less material about model deployment. Yet, it is precisely at this stage where all the hard work of data preparation and model building starts to pay off. This is where models are used to score (or get predictions for) new cases and extract the benefits. My intent here is to fill this gap, so that you will be fully prepared to deploy your model using time tested resources.
Continuing our series on Gojek's Machine Learning Platform, we tackle the problem of model deployment and serving in this article. The next steps are typically some of the most time-consuming and require expertise outside of normal data science work. The first thing a data scientist would encounter when trying to deploy and serve their model is the myriad tools and workflows. Most data scientists use Python or R, dozens of ML libraries (Tensorflow, xgboost, gbm, H2O), and are familiar with Jupyter notebooks or their favorite IDE. However, model deployment and serving introduces a whole new universe of technologies: Docker, Kubernetes, Jenkins, nginx, REST, protobuf, ELK, Grafana, are just some of the many unfamiliar names that a data scientist might encounter when they try to get their work into production.
Here's the thing about machine learning: use the right datasets and it'll help you root out malware with great accuracy and efficiency. But the models are what they eat. Feed them a diet of questionable, biased data and it'll produce garbage. That's the message Sophos data scientist Hillary Sanders delivered at Black Hat USA 2017 on Wednesday in a talk called "Garbage in, Garbage Out: How Purportedly Great Machine Learning Models Can Be Screwed Up By Bad Data". A lot of security experts tout machine learning as the next step in anti-malware technology.
The platform supports both R and Python code, appealing to both data scientists and machine learners. The platform also allows integration with legacy software platforms like SAP, Oracle, and Microsoft, which bolsters backend connectivity. In terms of monitoring, the platform identifies bottleneck issues and other performance-related issues, complete with helpful visualizations. Lead identification and qualification can be two of the most time-consuming parts of the sales workflow, so this tool could be really helpful for both small and big teams looking to streamline their sales processes.