# Data Quality

### Training Machine Learning Models Using Noisy Data - Butterfly Network

Dr. Zaius: I think you're crazy. The concept of a second opinion in medicine is so common that most people take it for granted, especially given a severe diagnosis. Disagreement between two doctors may be due to different levels of expertise, different levels of access to patient information or simply human error. Like all humans, even the world's best doctors make mistakes. At Butterfly, we're building machine learning tools that will act as a second pair of eyes for a doctor and even automate part of their workflow that is laborious or error prone.

### Data Cleansing for Models Trained with SGD

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can suggest influential instances without using any domain knowledge. With the proposed method, users only need to inspect the instances suggested by the algorithm, implying that users do not need extensive knowledge for this procedure, which enables even non-experts to conduct data cleansing and improve the model. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.

### Global Big Data Conference

For years, the sheer messiness of data slowed efforts to launch artificial intelligence (A.I.) and machine learning projects. Companies weren't willing to wait a year or two while data analysts cleaned up a massive dataset, and executives sometimes had a hard time trusting the outputs of a platform or tool built on messy data. Data pre-processing is a well-established art, and there are many tech pros out there who specialize in tweaking datasets for maximum validity, accuracy, and completeness. It's a tough job, and someone has to do it (usually with the assistance of tools, as well as specialized libraries such as Pandas). But now IBM is trying to apply A.I. to this issue, via new data prep tools within AutoAI, itself a tool within the cloud-based Watson Studio.

### Machine learning for data cleaning and unification

The biggest problem data scientist face today is dirty data. When it comes to real world data, inaccurate and incomplete data are the norm rather than the exception. The root of the problem is at the source where data being recorded does not follow standard schemas or breaks integrity constraints. The result is that dirty data gets delivered downstream to systems like data marts where it is very difficult to clean and unify, thus making it unreliable to utilize for analytics. Today data scientists often end up spending 60% of their time cleaning and unifying dirty data before they can apply any analytics or machine learning.

### Poor data quality causing majority of artificial intelligence projects to stall

A majority of enterprises engaged in artificial intelligence and machine learning initiatives (78 percent) said these projects have stalled--and data quality is one of the culprits--according to a new study from Dimensional Research. Nearly eight out of 10 organizations using AI and ML report that projects have stalled, and 96 percent of these companies have run into problems with data quality, data labeling required to train AI, and building model confidence, said the report, which was commissioned by training platform provider Alegion. For the research, Dimensional conducted a worldwide survey of 227 enterprise data scientists, other AI technologists, and business stakeholders involved in active AI and ML projects. Data issues are causing enterprises to quickly burn through AI project budgets and face project hurdles, the study said. Other findings of the survey: 70 percent of the respondents report that their first AI/ML investment was within last 24 months; more than half of enterprises said they have undertaken fewer than four AI and ML projects; and only half of enterprises have released AI/ML projects into production.

### Learning to combine Grammatical Error Corrections

The field of Grammatical Error Correction (GEC) has produced various systems to deal with focused phenomena or general text editing. We propose an automatic way to combine black-box systems. Our method automatically detects the strength of a system or the combination of several systems per error type, improving precision and recall while optimizing $F$ score directly. We show consistent improvement over the best standalone system in all the configurations tested. This approach also outperforms average ensembling of different RNN models with random initializations. In addition, we analyze the use of BERT for GEC - reporting promising results on this end. We also present a spellchecker created for this task which outperforms standard spellcheckers tested on the task of spellchecking. This paper describes a system submission to Building Educational Applications 2019 Shared Task: Grammatical Error Correction. Combining the output of top BEA 2019 shared task systems using our approach, currently holds the highest reported score in the open phase of the BEA 2019 shared task, improving F0.5 by 3.7 points over the best result reported.

### Computing Exact Guarantees for Differential Privacy

Quantification of the privacy loss associated with a randomised algorithm has become an active area of research and $(\varepsilon,\delta)$-differential privacy has arisen as the standard measure of it. We propose a numerical method for evaluating the parameters of differential privacy for algorithms with continuous one dimensional output. In this way the parameters $\varepsilon$ and $\delta$ can be evaluated, for example, for the subsampled multidimensional Gaussian mechanism which is also the underlying mechanism of differentially private stochastic gradient descent. The proposed method is based on a numerical approximation of an integral formula which gives the exact $(\varepsilon,\delta)$-values. The approximation is carried out by discretising the integral and by evaluating discrete convolutions using a fast Fourier transform algorithm. We give theoretical error bounds which show the convergence of the approximation and guarantee its accuracy to an arbitrary degree. Experimental comparisons with state-of-the-art techniques illustrate the efficacy of the method. Python code for the proposed method can be found in Github (https://github.com/DPBayes/PLD-Accountant/).

### Concentration bounds for linear Monge mapping estimation and optimal transport domain adaptation

This article investigates the quality of the estimator of the linear Monge mapping between distributions. We provide the first concentration result on the linear mapping operator and prove a sample complexity of $n^{-1/2}$ when using empirical estimates of first and second order moments. This result is then used to derive a generalization bound for domain adaptation with optimal transport. As a consequence, this method approaches the performance of theoretical Bayes predictor under mild conditions on the covariance structure of the problem. We also discuss the computational complexity of the linear mapping estimation and show that when the source and target are stationary the mapping is a convolution that can be estimated very efficiently using fast Fourier transforms. Numerical experiments reproduce the behavior of the proven bounds on simulated and real data for mapping estimation and domain adaptation on images.

### Data cleaning in Python: some examples from cleaning Airbnb data

I previously worked for a year and a half at an Airbnb property management company, as head of the team responsible for pricing, revenue and analysis. One thing I find particularly interesting is how to figure out what price to charge for a listing on the site. Although'it's a two bedroom in Manchester' will get you reasonably far, there are actually a huge number of factors that can influence a listing's price. As part of a bigger project on using deep learning to predict Airbnb prices, I found myself thrown back into the murky world of property data. Geospatial data can be very complex and messy -- and user-entered geospatial data doubly so.

### Practical Strategies to Handle Missing Values - DZone AI

One of the major challenges in most BI projects is to figure out a way to get clean data. This is true for both BI and Predictive Analytics projects. To improve the effectiveness of the data cleaning process, the current trend is to migrate from the manual data cleaning to more intelligent machine learning-based processes. Before we dig into figuring out how to handle missing values, it's critical to figure out the nature of the missing values. There are three possible types, depending on if there exists a relationship between the missing data with the other data in the dataset.