Goto

Collaborating Authors

 change data capture


Top 10 Essentials for Modern Data Integration - DATAVERSITY

#artificialintelligence

Data integration challenges are becoming more difficult as the volume of data available to large organizations continues to increase. Business leaders clearly understand that their data is of critical value but the volume, velocity, and variety of data available today is daunting. Faced with these challenges, companies are looking for solutions with a scalable, high-performing data integration approach to support a modern data architecture. The problem is that just as data integration is increasingly complex, the number of potential solutions is endless. From DIY products built by an army of developers to out-of-the-box solutions covering one or more use cases, it's difficult to navigate the myriad of choices and subsequent decision tree.


Google's Data Cloud Summit Serves Up A Breadth Of New Capabilities

#artificialintelligence

I mentioned in my preview of the Google Data Cloud Summit this week that I was expecting some exciting technology announcements for AI, machine learning, data management, and analytics. Google did not disappoint in that department. The Google Cloud stated mission is to accelerate every organization's ability to transform through data-powered innovation. A theme that should come as no surprise to anyone, but in Google's case, supported by a slew of new technologies and innovations. In this article, I will unpack and analyze some of the announcements from this busy week.


Change Data Capture (CDC) and Kafka

@machinelearnbot

Change Data Capture (CDC) is an approach to data integration that is based on the identification, capture, and delivery of the changes made to data sources, typically relational databases. Change operations can be the INSERT of a new record, an UPDATE or DELETE of an existing record. With Apache Kafka and in particular with the Kafka Connect APIs and the Kafka Connect source connectors available it's very easy to create data pipeline which will capture and deliver changes from an existing RDBMS to a Kafka cluster. From there you can send those changes to downstream systems, typically NoSQL storage systems (such as Cassandra, MongoDB, Couchbase, etc.) or search engines (such as Elasticsearch). It is also possible and advisible to keep changes stored or cached in a Kafka compacted topic, this way if you want to perform parallel joins via Kafka Streams or KSQL, the joins will be done easily and efficiently in parallel with no repartitioning necessary.