Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-memory computation capabilities make it a good choice for iterative algorithms in machine learning and graph computations. Spark is also compatible with Azure Blob storage (WASB) so your existing data stored in Azure can easily be processed via Spark. When you create a Spark cluster in HDInsight, you create Azure compute resources with Spark installed and configured.
This post is authored by Debraj GuhaThakurta, Senior Data Scientist, and Brad Severtson, Senior Content Developer, at Microsoft. The data scientists among you would have seen how Spark 2.0, which released in July 2016, offered several enhancements over Spark 1.6. Microsoft Azure released Spark 2.0 on HDInsight (Linux) as a service in September 2016. To help users get a jumpstart with using Spark 2.0 on HDInsight for data science and machine learning, we are providing end-to-end data science walkthroughs using Spark 2.0 on HDInsight. This is an update to the Spark 1.6 -based walkthrough that we published in June 2016, as a part of the Team Data Science Process documentation.
Today, we announced the general availability of R Server for Azure HDInsight. This gives Azure HDInsight the most comprehensive set of ML algorithms and statistical functions in the cloud that also leverages Hadoop and Spark. R is one of the most popular programming language that helps millions of data scientists solve their most challenging problems in fields ranging from computational biology to quantitative marketing. R Server for Azure HDInsight is a scale-out implementation of R integrated with Spark clusters created from HDInsight. This gives you the familiarity of the R language for machine learning while leveraging the scalability and reliability built into Spark.
When Microsoft started out dipping its toes into the Hadoop waters, it worked with Hortonworks to port Hadoop to Windows and run it in the Azure cloud. But running Hortonworks Data Platform (HDP) for Windows meant HDInsight (as Hadoop on Azure was eventually branded) was always a step behind the more mainstream Linux distributions, and constantly playing catch-up. When Microsoft decided to offer HDInsight clusters running on Linux, everything changed. Support from across the industry materialized and the newest Hadoop features were added to the service in much faster timeframes. Still, HDInsight has been due for a polishing, and today Microsoft is announcing just that.
Household names such as Adobe, Jet, ASOS, Schneider Electric, and Milliman are amongst hundreds of enterprises that are powering their Big Data Analytics using Azure HDInsight. Azure HDInsight launched nearly six years ago and has since become the best place to run Apache Hadoop and Spark analytics on Azure. We will monitor the cluster and all the services, detect and repair common issues and respond to issues 24/7. Your big data applications can run more reliably as your HDInsight service monitors the health and automatically recovers from failures. Isolate your HDInsight cluster within VNETs and take advantage of transparent data encryption.