Apache Spark continues to forge ahead with help from IBM's Spark Technology Center.


Apache Spark puts both deep and broad advanced analytics capabilities in the hands of the masses. Whether a data scientist, data engineer, analytics app developer or citizen analyst – Spark delivers sophisticated analytics simpler, faster and more efficiently than ever before. Spark is currently one of the most active open source project for big data. The latest release, Spark 2.0, is the result of nearly 2,500 contributions, with consistently more than 100 contributors per month. The new release is a significant milestone, and builds upon the input of the rapidly growing user and developer community.

Spark 2.0: more performance, more statistical models


Apache Spark, the open-source cluster computing framework, will soon see a major update with the upcoming release of Spark 2.0. This update promises to be faster than Spark 1.6, thanks to a run-time compiler that generates optimized bytecode. It also promises to be easier for developers to use, with streamlined APIs and a more complete SQL implementation. Spark 2.0 will also include a new "structured streaming" API, which will allow developers to write algorithm for streaming data without having to worry about the fact that streaming data is always incomplete; algorithms written for complete DataFrame objects will work for streams as well. This update also includes some news for R users.

As a reminder, a one-page summary of all the courses, books & videos I've reviewed in the past year can be found on my Journey Roadmap page. The highlight of May was attending ODSC East (Open Data Science Conference) in Boston. I wrote extensively about it in my previous blog post, so I won't repeat anything about it here. May provided a few more steps forward in my Data Science knowledge. I also have spent a good portion of June emptying my house in preparation for a sale, for which I now wait.