A data stream is an unbounded sequence of data arriving continuously. Streaming divides continuously flowing input data into discrete units for further processing. Stream processing is low latency processing and analyzing of streaming data. Spark Streaming was added to Apache spark in 2013, an extension of the core Spark API that provides scalable, high-throughput and fault-tolerant stream processing of live data streams. Data ingestion can be done from many sources like Kafka, Apache Flume, Amazon Kinesis or TCP sockets and processing can be done using complex algorithms that are expressed with high-level functions like map, reduce, join and window.
Spark Streaming provides a scalable, fault tolerant, efficient way of processing streaming data using Spark's simple programming model. It converts streaming data into "micro" batches, which enable Spark's batch programming model to be applied in Streaming use cases. This unified programming model makes it easy to combine batch and interactive data processing with streaming. The core abstraction in Spark Streaming is Discretized Stream (DStream). DStream is a sequence of RDDs.
This post is the second part in a series where we will build a real-time example for analysis and monitoring of Uber car GPS trip data. If you have not already read the first part of this series, you should read that first. The first post discussed creating a machine learning model using Apache Spark's K-means algorithm to cluster Uber data based on location. This second post will discuss using the saved K-means model with streaming data to do real-time analysis of where and when Uber cars are clustered. The example data set is Uber trip data, which you can read more about in part 1 of this series.
This article is part of the forthcoming Data Science for Internet of Things Practitioner course in London. If you want to be a Data Scientist for the Internet of Things, this intensive course is ideal for you. We cover complex areas like Sensor fusion, Time Series, Deep Learning and others. We work with Apache Spark, R language and leading IoT platforms. This is the 1st part of a series of 3 part article which discusses SQL with Spark for Real Time Analytics for IOT.
Spark is a powerful tool which can be applied to solve many interesting problems. Some of them have been discussed in our previous posts. Today we will consider another important application, namely streaming. Streaming data is the data which continuously comes as small records from different sources. There are many use cases for streaming technology such as sensor monitoring in industrial or scientific devices, server logs checking, financial markets monitoring, etc.