One of Azure's ever-increasing number of new data platform services, Azure Data Factory provides the ability to execute and orchestrate Big Data workflows. Azure Data Factory allows you to manage and produce information by offering an easy way to create, orchestrate, and monitor data pipelines. Activities are combined into Pipelines, and these pipelines are used to move and transform structured, semi-structured and unstructured data among and between data sources inside and outside of the Azure ecosystem. The entire process is then known as a "Factory". For instance, you can connect to your on-premises SQL Server, Azure SQL database, Azure tables or blobs and create data pipelines that will process that data using either a variety of Hadoop tools such as Hive and Pig scripting, or custom C# processing.
This post is by Robert Alexander, Solution Architect in the Data Group at Microsoft. Big data is big and getting bigger. Better get used to a whole new set of prefixes beyond peta such as exa, zetta, and yotta. The sheer size of data presents both immense opportunities and real challenges to businesses. Having access to mountains of data – from the internet of things, from applications in the cloud, from mobile devices – will enable businesses to drive faster and better decisions.
Beginning in 2016, Microsoft rolled out a preview of Microsoft R Server (MRS) for Azure HDInsight clusters. Recent blog posts (by Max Kaznady and David Smith) have highlighted how to use and tune this service for large scale machine learning tasks. In this post, we push the envelope and show how to build an end-to-end fully operationalized analytics pipeline using Azure Data Factory (ADF) and MRS with HDInsight (specifically Apache Spark). By integrating Azure Data Factory with Microsoft R Server and Spark, we show how to configure a scalable training and testing pipeline that operates on large volumes of data.
Gaurav Malhotra joins Scott Hanselman to show how you can run your Azure Machine Learning (AML) service pipelines as a step in your Azure Data Factory (ADF) pipelines. This enables you to run your machine learning models with data from multiple sources (85 data connectors supported in ADF). This seamless integration enables batch prediction scenarios such as identifying possible loan defaults, determining sentiment, and analyzing customer behavior patterns.
Gartner has released its 2020 Data Science and Machine Learning Platforms Magic Quadrant, and we are excited to announce that Databricks has been recognized as a Leader. Gartner evaluated 17 vendors for their completeness of vision and ability to execute. We are confident the following attributes contributed to the company's success: The biggest advantage of Databricks' Unified Data Analytics Platform is its ability to run data processing and machine learning workloads at scale and all in one place. Customers praise Databricks for significantly reducing TCO and accelerating time to value, thanks to its seamless end-to-end integration of everything from ETL to exploratory data science to production machine learning. With Databricks, data teams can build reliable data pipelines with Delta Lake, which adds reliability and performance to existing data lakes.