This article was originally posted here, by Mubashir Qasim. In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite for machine learning is data analysis, not math. One of the main reasons for making this statement, is that data scientists spend an inordinate amount of time on data analysis. The traditional statement is that data scientists "spend 80% of their time on data preparation." While I think that this statement is essentially correct, a more precise statement is that you'll spend 80% of your time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization.
As the director of datamine decision support systems, I've delivered more than 80 data-intensive projects -- including data warehousing, data integration, business intelligence, content performance and predictive models -- across several industries and high-profile corporations. In most cases, data quality proved to be a critical success factor. The obvious challenge in every case was to effectively query heterogeneous data sources, then extract and transform data towards one or more data models. The non-obvious challenge was the early identification of data issues, which in most cases were unknown to the data owners as well. There are many aspects to data quality, including consistency, integrity, accuracy, and completeness.
Databricks, the inventor and commercial distributor of the Apache Spark processing platform, has announced a system called Delta, which it believes will appeal to CIOs as a data lake, a data warehouse and a "streaming ingest system". It is said to eliminate the need for extract, transform and load (ETL) processes. Yes, there is a lot of hype, but there is real worth in AI and Machine Learning. Read our counseling on how to avoid adopting "black box" approach. You forgot to provide an Email Address.
With its growing emphasis on all things AI -- coupled with its history as a tool vendor -- it's not surprising that Microsoft is working on tools not just for traditional programmers, but also data scientists. According to a Microsoft Research presentation from earlier this year, data scientists currently spend 80 percent of their time extracting and cleaning data -- a k a "data wrangling." Microsoft wants to fix this. A year ago, I first heard from a contact of mine about a new machine-learning-related tool under development by Microsoft that was codenamed "Pendleton." But it wasn't until The Walking Cat (@h0x0d on Twitter) unearthed some more information and documents that I had enough information to write about Pendleton.
Data scientists, data analysts, business analyst, owners of a data driven company, what do they have in common? They all need to be sure that the data that they'll be consuming is at its optimal stage. Right now with the emergence of Big Data, Machine Learning, Deep Learning and Artificial Intelligence (The New Era as I call it) almost every company or entrepreneur wants to create a solution that uses data to predict or analyze. Until now there was no solution to the common problem for all data driven projects for the New Era - Data cleansing and exploration. With Optimus we are launching an easy to use, easy to deploy to production, and open source framework to clean and analyze data in a parallel fashion using state of the art technologies.