The second major version of Azure Data Factory, Microsoft's cloud service for ETL (Extract, Transform and Load), data prep and data movement, was released to general availability (GA) about two months ago. Cloud GAs come so fast and furious these days that it's easy to be jaded. But data integration is too important to overlook, and I wanted to examine the product more closely. Roughly thirteen years after its initial release, SQL Server Integration Services (SSIS) is still Microsoft's on-premises state of the art in ETL. It's old, and it's got tranches of incremental improvements in it that sometimes feel like layers of paint in a rental apartment.
Data scientists, data analysts, business analyst, owners of a data driven company, what do they have in common? They all need to be sure that the data that they'll be consuming is at its optimal stage. Right now with the emergence of Big Data, Machine Learning, Deep Learning and Artificial Intelligence (The New Era as I call it) almost every company or entrepreneur wants to create a solution that uses data to predict or analyze. Until now there was no solution to the common problem for all data driven projects for the New Era - Data cleansing and exploration. With Optimus we are launching an easy to use, easy to deploy to production, and open source framework to clean and analyze data in a parallel fashion using state of the art technologies.
Outdated, inaccurate, or duplicated data won't drive optimal data driven solutions. When data is inaccurate, leads are harder to track and nurture, and insights may be flawed. The data on which you base your big data strategy must be accurate, up-to-date, as complete as possible, and should not contain duplicate entries. Cleaning data is the most time-consuming and least enjoyable data science task (until Optimus), but one of the most important ones. No one can start a data science, machine learning or data driven solution without being sure that the data that they'll be consuming is at its optimal stage.
"Data wrangling" was an interesting phrase to hear in the machine learning (ML) presentations at Microsoft Ignite. Interesting because data wrangling is from business intelligence (BI), not from artificial intelligence (AI). Microsoft understands ML incorporates concepts from both disciplines. Further discussions point to another key point: Microsoft understands that business-to-business (B2B) is just as fertile for ML as business-to-consumer (B2C). ML applications with the most press are voice, augmented reality and autonomous vehicles.
Qualified data providers include category-leading brands such as Reuters, who curate data from over 2.2 million unique news stories per year in multiple languages; Change Healthcare, who process and anonymize more than 14 billion healthcare transactions and $1 trillion in claims annually; Dun & Bradstreet, who maintain a database of more than 330 million global business records; and Foursquare, whose location data is derived from 220 million unique consumers and includes more than 60 million global commercial venues. For qualified data providers, AWS Data Exchange makes it easy to reach the millions of AWS customers migrating to the cloud by removing the need to build and maintain infrastructure for data storage, delivery, billing, and entitling. Enterprises, scientific researchers, and academic institutions have been using third-party data for decades to conduct research, power applications and analytics, train machine-learning models, and make data-driven decisions. But, as these customers subscribe to more third-party data, they often have to wait weeks to receive shipped physical media, manage sensitive credentials for multiple File Transfer Protocol (FTP) hosts and periodically check for updates, or code to several disparate application programming interfaces (APIs). These methods are inconsistent with the modern architectures customers are developing in the cloud.