Collaborating Authors

Data Engineering Technologies in 2021


Airflow is an open-source workflow management platform for data engineering pipelines. Alation focused on data governance, analytics, and data management. Alluxio is an open-source data orchestration layer that brings Data close to compute for big data and AI/ML workloads in the cloud. Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists, and engineers. Anodot detects and groups anomalies across silos to help you find and fix business incidents in real-time.

The Year in Machine Learning (Part Two)


This is the second installment in a three-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Part Three will cover commercial machine learning and deep learning software and services. There are thousands of open source projects on the market today, and we cannot cover them all. We've selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub. In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source "community" editions together with commercially licensed software.

Pentaho version 8: Here's what's new and improved


Pentaho, a product that originally launched over a decade ago as an open source business intelligence package, will soon be available in a version 8.0 release. Pentaho existed an an independent company for more than a decade, until it was acquired by Hitachi Data Systems (HDS) in 2015. HDS integrated Pentaho into its own offerings and services implementations, but otherwise left most things running as they had been before the acquisition. That changed last month, when Hitachi announced it was combining Pentaho, HDS and Hitachi Insight Group (the unit responsible for the Lumada IoT platform) into a single new division called Hitachi Vantara. While Pentaho as a distinct company has now been phased out, the Pentaho product and brand have not in any way been withdrawn.

Salesforce's PredictionIO Donated to the Apache Software Foundation


A few years ago, I started PredictionIO, an open source machine learning platform, with the mission to scale and simplify the development of machine learning technology. PredictionIO quickly grew in prominence and was even ranked on Github as the most popular Apache Spark-based machine learning product in the world. When Salesforce acquired PredictionIO in February, I was excited to have the amazing opportunity to continue to build our platform on a much larger scale. Today, I am thrilled to announce that Salesforce will donate the PredictionIO trademark to the Apache Software Foundation (ASF) and by unanimous vote the platform has been accepted into the ASF incubator program. This demonstrates the open source community's recognition of the importance of the PredictionIO project.

Apache Arrow unifies in-memory Big Data systems


In-memory data systems have have had a panache for several years now. From SAP HANA to Apache Spark, customers and industry watchers have been continually intrigued by systems that can operate on data directly in memory, bypassing the slowness of disks and the sequential read rubric of file systems. Whether or not in-memory is always the best way to go, it's usually a crowd-pleaser. In fact, most modern BI systems use their own in-memory engines that also store the data in a column-wise fashion. Doing so allows for high rates of compression, because data for a given column across rows will often be the same, or very close, in value.