data quality


Machine Learning with Scala on Spark by Jose Quesada

#artificialintelligence

This video was recorded at Scala Days Berlin 2016 follow us on Twitter @ScalaDays or visit our website for more information http://scaladays.org Abstract: What new superpowers does it give me? The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with Spark.ml, and compare it to other widely used frameworks (in R and python) along several dimensions: ease of use, productivity, feature set, and performance.


Machine Learning with Scala on Spark by Jose Quesada

#artificialintelligence

This video was recorded at Scala Days Berlin 2016 follow us on Twitter @ScalaDays or visit our website for more information http://scaladays.org Abstract: What new superpowers does it give me? The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with Spark.ml, and compare it to other widely used frameworks (in R and python) along several dimensions: ease of use, productivity, feature set, and performance.


Machine Learning and Data Quality

#artificialintelligence

Classic examples are television, where the data concerning the programmes you watch or display and interest in watching can allow the Machine Learning software to identify other shows you would like; and Facebook, where their Machine Learning programme works out which news items appear on your timeline based on your activity and commenting on the site. A basic and fundamental truth concerning Machine Learning is that the best designed computer algorithms and other things Machine Learning can do are only going to be as good as the data the Machine Learning software works with. Ambitious programmers who give their Machine Learning programmes large quantities of Big Data to work with are bound to be disappointed if the Machine Learning appears to have learnt nothing, or writes its own algorithms that then don't work, the cause being in the poor quality of the data, dirty and full of corruptions, mismatches, duplications and other inaccuracies. Spotless Data's unique web-based API solution to dirty data can be built into your Machine Learning software in its design or build phase or you can simply pass your data through our unique web-based API before entering it into the data lake or warehouse where the machine learning computer software will start to work with it, roducing the Machine Learning software and algritjms that will allow your company to stand out among its competitors and attract the lion's share of the pool of potential customers.


5 Free Data Science eBooks For Your Summer Reading List

@machinelearnbot

You will need a basic understanding of statistical concepts and R programming, and the book is intended for practicing Data Scientists but as long as you tick these boxes you should be fine. The book is offered on the Pay-What-You-Want model, including free, and helpfully, they also offer it as a tablet-friendly pdf, also free. Instead of explaining the mathematics and theory, and then showing examples, the authors start with a practical data-related life science challenge. There is also a free Microsoft Excel Practical Data Cleaning template to help you get a good start with your data.


Data Cleansing Tools in Azure Machine Learning

#artificialintelligence

Today, we'll discuss the impact of data cleansing in a Machine Learning model and how it can be achieved in Azure Machine Learning (Azure ML) studio. After running the experiment and creating the scatter plot again (using the clipped amount), the outliers have been removed and the plot looks as follows. To treat null values, the Clean Missing Data module can be used. The module returns a data set that contains the original samples, plus an additional number of synthetic minority samples, depending on the percentage you specify.


5 Free Data Science eBooks For Your Summer Reading List

@machinelearnbot

You will need a basic understanding of statistical concepts and R programming, and the book is intended for practicing Data Scientists but as long as you tick these boxes you should be fine. The book is offered on the Pay-What-You-Want model, including free, and helpfully, they also offer it as a tablet-friendly pdf, also free. Instead of explaining the mathematics and theory, and then showing examples, the authors start with a practical data-related life science challenge. There is also a free Microsoft Excel Practical Data Cleaning template to help you get a good start with your data.


No, Kaggle is unsuitable to study AI & ML. A reply to Ben Hamner

@machinelearnbot

In a recent Quora session, Kaggle CTO Ben Hamner outlined his advice to study machine learning. At the end of the post, I suggest an alternative platform, Startcrowd, to build real-world AI products, instead of statistical models. If you start from scratch, with no coding skills, nor data science experience, I personally recommend the Python course on Codecademy, the Andrew Ng ML course on Coursera, the Intro to Data Science on Udacity, and the Stanford courses on Convolutional Neural Networks and NLP. If performance is really the issue, then you can follow Ben's third advice: acquire more data, improve data cleaning, or optimize the model like a Kaggle player.


AI will hurt banking without a ground-up approach

#artificialintelligence

So, unless banks completely rethink their customer interfacing model, nimble players who can create clever AI applications on customer's financial information, will make banks just a utility provider, hence hurting their margins. Banks have two challenges to resolve at the same time, internal data quality issues to make AI work for operational intelligence, and external data ownership issues to make AI work for dealing with customers. Barclays with access to mortgage transactions from Halifax could create a credit product for the customer for home renovation. While this sounds good from the banks' perspective, the more likely outcome would be Fintech firms, and possibly tech giants (Google, IBM, Microsoft, Amazon) making clever use of customer transaction data as they are light years ahead of banks in their AI capabilities.


20/20 View of 2020: 10 Trends for the Digital Future » Brave New Coin

#artificialintelligence

The recent numbers published by the largest clearing houses support this migration mainly due to crippling regulatory burdens of keeping larger amounts of capital for OTC trades. If we do not fully automate that part of the cleared trade process, we will not achieve the full compliance needed to create the efficiencies predicted by many. For example, we see the use of AI in our platform as a means to achieve two major goals: data quality, and trade breaks reconciliation and remediation. This example shows that the holy grail in terms of our platform's goals is to improve data quality, improve data matching of non-structured data, and help our AI algos correct data impurities on their own.


Using Machine Learning for Data Quality Matching - Talend

#artificialintelligence

In my last blog, I highlighted some of the Data Governance challenges in Big Data and how Data Quality (DQ) is a big part of Data Governance. Big Data has made Machine Learning (ML) mainstream and just as DQ has impacted ML, ML is also changing the DQ implementation methodology. The reason ML is becoming mainstream is because Big Data processing engines such as Spark have made it possible for developers to now use ML libraries to process their code. In summary, by combining the power of ML with Spark and data quality processes this workflow can be used to predict matches for data sets automatically.