productionizing
Productionizing Distributed XGBoost to Train Deep Tree Models with Large Data Sets at Uber
Zero values in SparseVectors are treated by XGBoost on Apache Spark as missing values (defaults to Float.NaN) whereas zeroes in DenseVectors are simply treated as zeros. Vector storage in Apache Spark ML is implicitly optimized, so a vector array is stored as a SparseVector or DenseVector based on space efficiency. If an ML practitioner tries to feed a DenseVector at inference time to a model that is trained on SparseVector or vice versa, XGBoost does not provide any warning and the prediction input will likely go into unexpected branches due to the way zeroes are stored, resulting in inconsistent predictions. Hence, it is critical that the storage structure input remains consistent between serving and training times.
Productionizing your Machine Learning model
Even though pushing your Machine Learning model to production is one of the most important steps of building a Machine Learning application there aren't many tutorials out there showing how to do so. Therefore in this article, I will go over how to productionize a Ludwig model by building a Rest-API as well as a normal website using the Flask web micro-framework. The same process can be applied to almost any other machine learning framework with only a few little changes. Both the website and RESTful-API we will build in this article are very simple and won't scale very well. Therefore in the next article, we will take a look at how to scale our Flask website to handle multiple recurrent requests, which is also shown in this excellent article.