At HomeAway, we use Apache Kafka as the backbone for our streaming architecture. We also like to deploy machine learning models to make realtime predictions on our data streams. Confluent KSQL provides an easy to use and interactive SQL interface for performing stream processing on Kafka. Below we show how to build a model in Python and use the model in KSQL to make predictions based on a stream of data in Kafka. We use Predictive Model Markup Language (PMML) to enable the ability to train the model using the Python library Scikit-learn, but perform model inference in Java-based KSQL.
We are going to build a neural network from scratch in Python without the use of a library. The iris data is going to be used to train our model and obtain a high accuracy. We would not be getting into the mathematical background of neural networks, as there are a lot of amazing medium articles covering it (Article 1, Article 2). The iris data is the most commonly used data set for testing machine learning algorithms. The data contains four features -- sepal length, sepal width, petal length, and petal width for the different species (versicolor, virginica and setosa) of the flower, iris.
It is schema-based, and wraps scikit-learn. To get a feel for the library, consider the classic Iris dataset, where we predict the class of iris plant from measurements of the sepal, and petal. First, we create a schema describing our inputs and outputs. For our inputs, we have the length, and width, of both the sepal, and the petal. All of these input values happen to be numbers.
Unsupervised Learning is a class of Machine Learning techniques to find the patterns in data. The data given to unsupervised algorithm are not labelled, which means only the input variables(X) are given with no corresponding output variables. In unsupervised learning, the algorithms are left to themselves to discover interesting structures in the data. In supervised learning, the system tries to learn from the previous examples that are given. So if the dataset is labelled it comes under a supervised problem, it the dataset is unlabelled then it is an unsupervised problem.