You probably did not hear it here first. Spark has been making waves in big data for a while now, and 2017 has not disappointed anyone who has bet on its meteoric rise. That was a pretty safe bet actually, as interpreting market signals, speaking with pundits and monitoring data all pointed to the same direction.
At Google's Cloud Next event in San Francisco, partners are announcing third-party data services on Google Cloud Platform (GCP) that offer impressive integration with Google's own offerings. Confluent, the company founded by the creators of Apache Kafka, is one such partner. Fivetran, which delivers data from a variety of databases and Software as a Services (SaaS) applications into major data warehouse platforms, is another. For Confluent, the story is pretty straightforward: Confluent Cloud, the company's managed service based on Apache Kafka, had been already been available for almost a year on GCP. But now, rather than simply running on GCP infrastructure, Confluent Cloud on GCP will integrate tightly with the first-party experience.
Confluent, the company whose founders created Apache Kafka, confidentially filed for IPO late yesterday. In effect, this filing was a formal intent to IPO, with key details such as the number of shares and proposed price range still in flux. Valued at over $4 billion, Confluent, along with companies like Databricks, could be considered super-unicorns, and given the inflating valuations, the question wasn't whether, but when Confluent would finally file for public offering. We're still asking the same thing for Databricks, whose funding has topped $1 billion and whose valuation is off the charts. Confluent achieved unicorn status a couple years ago, and like fellow former unicorn (now public MongoDB), adopted its own quasi open source licensing to prevent cloud providers from monetizing the IP, not from Kafka (which remains an Apache project), but all the enterprise goodies and connectors that the company has built around it.
Airflow is an open-source workflow management platform for data engineering pipelines. Alation focused on data governance, analytics, and data management. Alluxio is an open-source data orchestration layer that brings Data close to compute for big data and AI/ML workloads in the cloud. Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists, and engineers. Anodot detects and groups anomalies across silos to help you find and fix business incidents in real-time.
The big data ecosystem was on full display at last week's Strata Hadoop World conference in San Jose. At the ripe old age of 10, Hadoop is still the driving force, but newer frameworks like Spark and Kafka are gaining steam. Here are some of the top trends your Datanami editor pulled from the show based on observations and discussions with attendees and vendors. Let's start with the biggest news from Strata, which was the rise of Kafka and real-time streaming. As Kafka creator Jay Kreps tweeted it seemed "like every other presentation at Strata this year was on streaming data."