In the past few years there has been a large increase in tools trying to solve the challenge of bringing machine learning models to production. One thing that these tools seem to have in common is the incorporation of notebooks into production pipelines. This article aims to explain why this drive towards the use of notebooks in production is an anti pattern, giving some suggestions along the way. Let's start by defining what these are, for those readers who haven't been exposed to notebooks, or call them by a different name. Notebooks are web interfaces that allow a user to create documents containing code, visualisations and text.
Uncounted pixels have been spilled about how great Jupyter Notebooks are (shameless plug: I've spilled some of those pixels myself). Jupyter Notebooks allow data scientists to quickly iterate as we explore data sets, try different models, visualize trends, and perform many other tasks. We can execute code out-of-order, preserving context as we tweak our programs. We can even convert our notebooks into documents or slides to present to our stakeholders. Jupyter Notebooks help us work through a project from its earliest stages to a point where we can say a great deal.
One of the most common question people ask is which IDE / environment / tool to use, while working on your data science projects. As you would expect, there is no dearth of options available – from language specific IDEs like R Studio, PyCharm to editors like Sublime Text or Atom – the choice can be intimidating for a beginner. If there is one tool which every data scientist should use or must be comfortable with, it is Jupyter Notebooks (previously known as iPython notebooks as well). Jupyter Notebooks are powerful, versatile, shareable and provide the ability to perform data visualization in the same environment. Jupyter Notebooks allow data scientists to create and share their documents, from codes to full blown reports.
Notebooks are the data scientist best friend and can also be a nightmare to work with. For someone accustomed to work with modern integrated develop environments(IDEs), working with notebooks feels like going back decades. Furthermore, modern notebook environments is mostly constrained to Python programs and lack first-class support for other programming languages. Polynote was born out of the necessity to accelerate data science experimentation at Netflix. Over the years, Netflix has built a world-class machine learning platform mostly based on JVM languages like Scala.
Join us at Spark Summit to hear more about new functionalities of Apache Spark. Use the code Databricks20 to receive a 20% discount! As many data scientists and engineers can attest, the majority of the time is spent not on the models themselves but on the supporting infrastructure. Key issues include on the ability to easily visualize, share, deploy, and schedule jobs. More disconcerting is the need for data engineers to re-implement the models developed by data scientists for production.