Kafka Streams is a library designed to allow for easy stream processing of data flowing into a Kafka cluster. Stream processing has become one of the biggest needs for companies over the last few years as quick data insight becomes more and more important but current solutions can be complex and large, requiring additional tools to perform lookups and aggregations. K S in Action teaches readers everything they need to know to implement stream processing on data flowing into their platform, allowing them to focus on getting more from their data without sacrificing time or effort. By the end of the book, readers will be ready to use Kafka Streams in their projects to reap the benefits of the insight their data holds quickly and easily. Bill Bejeck is a Kafka Streams contributor with over 13 years of software development experience.
In simple words, Kafka Streams is a library which you can include in your Java based applications to build stream processing applications on top of Apache Kafka. Other distributed computing platforms like Apache Spark, Apache Storm etc. are widely used in the big data stream processing world, but Kafka Streams brings some unique propositions in this area Kafka Streams provides a State Store feature using which applications can store its local processing results (the state). RocksDB is used as the default state store and it can be used in persistent or in-memory mode. In our sample application, the state which we care about is the count of occurrences of the keywords which we chose to follow -- how is it implemented? Oracle Application Container Cloud provides access to a scalable in-memory cache and it's used the custom state store in our use case It's possible to scale our stream processing service both ways (details in the documentation) i.e. elastically
Using streaming technologies with Kafka Spark Cassandra to effectively gain insights on data. A tremendous stream of data is consumed and created by applications these days. These data include application logs, event transaction logs (errors, warnings), batch job data, IoT sensor data, social media, other external systems data and much many more. All this data flow can be piped through the data pipelines or stages that can give insights and provide tremendous benefits to the organization. As it was mentioned recently in an article in the Economist, "The world's most valuable resource is no longer oil, but data".
With the release of Apache Kafka 1.0 this week, an eight-year journey is finally coming to a temporary end. Temporary because the project will continue to evolve, see near-term big fixes, and long-term feature updates. But for Neha Narkhede, Chief Technology Officer of Confluent, this release is the culmination of work towards a vision she and a team of engineers first laid out in 2009. Back then, a team at LinkedIn decided it had the solution to a major data stream processing problem. Narkhede said the originators of Kafka first began their journey to building the project by sitting down and trying to understand why stream processing companies founded in the 1990's and 2000's had failed.
You may be building an application using event sourcing and need a store for the log of changes. Theoretically you could use any system to store this log, but Kafka directly solves a lot of the problems of an immutable log and "materialized views" computed off of that. The New York Times does this for all their article data as the heart of their CMS. You may have an in-memory cache in each instance of your application that is fed by updates from Kafka. A very simple way of building this is to make the Kafka topic log compacted, and have the app simply start fresh at offset zero whenever it restarts to populate its cache.