To learn more about cutting-edge data science tools like Apache Kafka, check out the Strata Data Conference in San Jose, March 5-8, 2018--registration is now open. Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. The cartoon version of machine learning sounds quite easy: you feed in training data made up of examples of good and bad outcomes, and the computer automatically learns from these and spits out a model that can make similar predictions on new data not seen before. What could be easier, right? Those with real experience building and deploying production systems built around machine learning know that, in fact, these systems are shockingly hard to build. This difficulty is not, for the most part, the algorithmic or mathematical complexities of machine learning algorithms. Creating such algorithms is difficult, to be sure, but the algorithm creation process is mostly done by academic researchers.
Intelligent real time applications are a game changer in any industry. Machine learning and its sub-topic, deep learning, are gaining momentum because machine learning allows computers to find hidden insights without being explicitly programmed where to look. This capability is needed for analyzing unstructured data, image recognition, speech recognition, and intelligent decision making. It is an important difference from traditional programming with Java, .NET, or Python. While the concepts behind machine learning are not new, the availability of big data sets and processing power allow every enterprise to build powerful analytic models.
"Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective," lectured Justin Norman during the talk on the deployment of Machine Learning models at this year's DataWorks Summit in Barcelona. Indeed, an industrial Machine Learning system is a part of a vast data infrastructure, which renders an end-to-end ML workflow particularly complex. The challenges linked to the development, deployment, and maintenance of the real-world ML systems should not be overlooked as we pursue the finest ML algorithms. Machine Learning is not necessarily meant to replace human decision making, it is mainly about helping humans make complex judgment base decisions. The talk I attended, Machine Learning Model Deployment: Strategy to Implementation, was given by Cloudera's experts, Justin Norman and Sagar Kewalramani.
Uber expanded Michelangelo "to serve any kind of Python model from any source to support other Machine Learning and Deep Learning frameworks like PyTorch and TensorFlow [instead of just using Spark for everything]." So why did Uber (and many other tech companies) build its own platform and framework-independent machine learning infrastructure? The posts How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and Using Apache Kafka to Drive Cutting-Edge Machine Learning describe the benefits of leveraging the Apache Kafka ecosystem as a central, scalable, and mission-critical nervous system. It allows real-time data ingestion, processing, model deployment, and monitoring in a reliable and scalable way. This post focuses on how the Kafka ecosystem can help solve the impedance mismatch between data scientists, data engineers, and production engineers. By leveraging it to build your own scalable machine learning infrastructure and also make your data scientists happy, you can solve the same problems for which Uber built its own ML platform, Michelangelo. Based on what I've seen in the field, an impedance mismatch between data scientists, data engineers, and production engineers is the main reason why companies struggle to bring analytic models into production to add business value.