Apache Hadoop YARN is a modern resource-management platform that handles resource scheduling, isolation and multi-tenancy for a variety of data processing engines that can co-exist and share a single data-center in a cost-effective manner. In the first half of the talk, we are going to give a brief look into some of the big efforts cooking in the Apache Hadoop YARN community. We will then dig deeper into one of the efforts - supporting Docker runtime in YARN. Docker is an application container engine that enables developers and sysadmins to build, deploy and run containerized applications. In this half, we'll discuss container runtimes in YARN, with a focus on using the DockerContainerRuntime to run various docker applications under YARN.
In some ways Hortonworks is old fashioned in that it still clings to the stretch goal of managing half of the world's data in an era where cloud object stores and bespoke analytic services are adding more alternatives to the mix. Hortonworks' aspirational goal may not be realistic, but never mind, there are bigger fish to fry. The underlying message from this year's North American DataWorks Summit and analyst briefings is that the company is competing and facing the challenges of navigating a multipolar cloud world. My big on data bro Andrew Brust reported the headlines coming out earlier in the week: Hortonworks is releasing the 3.0 version of its data platform that, confusingly, is based on Hadoop 3.1. As we reported back at the start of the year, the 3.x generation of Apache Hadoop will mark a watershed with containerization and storage.
The fall Strata conference is when Big Data makes it to Broadway. And the week was very much a blur. We used to come away from Strata with the memory of one or two overriding themes; last year it was machine learning and the new infatuation with Spark, before that it was about Hadoop opening up the opportunity for exploratory analytics and for Hadoop to disappear behind a veneer of familiar SQL. It's easy to get excited by the idealism around the shiny new thing. But let's set something straight: Spark ain't going to replace Hadoop.
Hadoop Summit San Jose has come to an end. This year, I was there to cover the news, and to present a breakout session. My talk focused on fragmentation in the industry: the Big Data ecosystem has too many vendors, too many Hadoop distributions, too many execution engines, too many Apache projects. An overly complex market for products and technologies that makes it really difficult for customers to make purchasing decisions. And maybe getting my talk ready biased the way I analyzed the news but, interestingly, it sure seemed like the issues I addressed in my session were on the minds of some of the news-making exhibitors at the conference.