In this post I'm going to show what streaming ETL looks like in practice. My first job from university was building a data warehouse for a retailer in the UK. Back then, it was writing COBOL jobs to load tables in DB2. We waited for all the shops to close and do their end of day system processing, and send their data back to the central mainframe. From there it was checked and loaded, and then reports generated on it.
The demand for real-time data processing is rising, and streaming vendors are proliferating and competing. Apache Kafka is a key component in many data pipeline architectures, mostly due to its ability to ingest streaming data from a variety of sources in real time. Confluent, the commercial entity behind Kafka, has the ambition to leverage this position to become a platform of choice for real-time application development in the enterprise. On the road to implementing this vision, Kafka has expanded its reach to include more than data ingestion -- most notably, processing. In this process, the overlap with other platforms is growing and Confluent seems set on adding features that will enable Kafka to stand out.
The Yelp Data Pipeline gives developers a suite of tools to easily move data around the company. We have outlined three main components of the core Data Pipeline infrastructure so far. First, the MySQLStreamer replicates MySQL statements and publishes them into a stream of schema-backed Kafka topics. Second, the Schematizer provides a centralized source of truth about each of our Kafka topics. It persists the Avro schema used to encode the data in a particular topic, the owners of this data, and documentation about various fields.