data quality


Global Big Data Conference

#artificialintelligence

Companies face issues with training data quality and labeling when launching AI and machine learning initiatives, according to a Dimensional Research report. The worldwide spending on artificial intelligence (AI) systems is predicted to hit $35.8 billion in 2019, according to IDC. This increased spending is no surprise: With digital transformation initiatives critical for business survival, companies are making large investments in advanced technologies. However, nearly eight out of 10 organizations engaged in AI and machine learning said that projects have stalled, according to a Dimensional Research report. The majority (96%) of these organizations said they have run into problems with data quality, data labeling necessary to train AI, and building model confidence.


Data cleaning in Python: some examples from cleaning Airbnb data

#artificialintelligence

I previously worked for a year and a half at an Airbnb property management company, as head of the team responsible for pricing, revenue and analysis. One thing I find particularly interesting is how to figure out what price to charge for a listing on the site. Although'it's a two bedroom in Manchester' will get you reasonably far, there are actually a huge number of factors that can influence a listing's price. As part of a bigger project on using deep learning to predict Airbnb prices, I found myself thrown back into the murky world of property data. Geospatial data can be very complex and messy -- and user-entered geospatial data doubly so.


Practical Strategies to Handle Missing Values - DZone AI

#artificialintelligence

One of the major challenges in most BI projects is to figure out a way to get clean data. This is true for both BI and Predictive Analytics projects. To improve the effectiveness of the data cleaning process, the current trend is to migrate from the manual data cleaning to more intelligent machine learning-based processes. Before we dig into figuring out how to handle missing values, it's critical to figure out the nature of the missing values. There are three possible types, depending on if there exists a relationship between the missing data with the other data in the dataset.


What To Know About The Impact Of Data Quality And Quantity In AI

#artificialintelligence

Believe it or not, there is such a thing as "good data"and "bad data" -- especially when it comes to AI. To be more specific, just having data available isn't enough: There's a distinction worth making between "useful" and "not-so-useful" data. Sometimes data must be discarded on sight because of how or where it got collected, signs of inaccuracy or forgery and other red flags. Other times, data can get processed first, then passed on for use in artificial intelligence development. A closer look at this process reveals a symbiotic relationship between our ability to gather data and process it -- and our ability to build ever-smarter artificial intelligence.


7 ways analytical methods improve data quality

#artificialintelligence

Data scientists spend a lot of their time using data. Data quality is essential for applying machine learning models to solve business questions and training AI models. However, analytics and data science do not just make demands on data quality. They can also contribute a lot to improving the quality of your data. Missing value imputation and detection of complex outliers are perhaps the two best-known capabilities of analytics in data quality, but they are by no means the only ones.


Artificial Intelligence Enabled audience profiling in MoMAGIC

#artificialintelligence

"Sell the right product to the right customer" is a dream of every marketers and advertisers. As a leading digital marketing company in Asia, MoMAGIC is approaching this ultimate goal by launching two data-driven solutions: TrueReach and TrueInsight, which is capable of understanding and targeting audiences from both macroscopic and microscopic point of view. In this article, we will share the core ideas behind our solutions and some insights we learned when playing with large-scale real data. As the backbone of all MoMAGIC services, TrueInsight integrates large-scale, heterogeneous data sources (e.g., AD-request, web behaviors, etc.) and transforms those high-frequent, noisy dataflows into structured datasets. In order to complete those challenging tasks quickly and accurately, we first carefully designed the data processing pipelines of TrueInsight in a parallel and distributed manner to guarantee its performance scalability, which means TrueInsight can process large-scale data from multiple sources without scarifying its performance.


Training a Neural Speech Waveform Model using Spectral Losses of Short-Time Fourier Transform and Continuous Wavelet Transform

arXiv.org Machine Learning

Recently, we proposed short-time Fourier transform (STFT)-based loss functions for training a neural speech waveform model. In this paper, we generalize the above framework and propose a training scheme for such models based on spectral amplitude and phase losses obtained by either STFT or continuous wavelet transform (CWT), or both of them. Since CWT is capable of having time and frequency resolutions different from those of STFT and is cable of considering those closer to human auditory scales, the proposed loss functions could provide complementary information on speech signals. Experimental results showed that it is possible to train a high-quality model by using the proposed CWT spectral loss and is as good as one using STFT-based loss.


Data Science and Digital Systems: The 3Ds of Machine Learning Systems Design

arXiv.org Artificial Intelligence

There is a lot of talk about the fourth industrial revolution centered around AI. If we are at the start of the fourth industrial we also have the unusual honour of being the first to name our revolution before it's occurred. The technology that has driven the revolution in AI is machine learning. And when it comes to capitalising on the new generation of deployed machine learning solutions there are practical difficulties we must address. In 1987 the economist Robert Solow quipped "You can see the computer age everywehere but in the productivity statistics".


AI Needs Better Data, Not Just More Data

#artificialintelligence

AI has a data quality problem. In a survey of 179 data scientists, over half identified addressing issues related to data quality as the biggest bottleneck in successful AI projects. Big data is so often improperly formatted, lacking metadata, or "dirty," meaning incomplete, incorrect, or inconsistent, that data scientists typically spend 80 percent of their time on cleaning and preparing data to make it usable, leaving them with just 20 percent of their time to focus on actually using data for analysis. This means organizations developing and using AI must devote huge amounts of resources to ensuring they have sufficient amounts of high-quality data so that their AI tools are not useless. As policymakers pursue national strategies to increase their competitiveness in AI, they should recognize that any country that wants to lead in AI must also lead in data quality.


Uncertainty quantification of molecular property prediction with Bayesian neural networks

arXiv.org Machine Learning

Deep neural networks have outperformed existing machine learning models in various molecular applications. In practical applications, it is still difficult to make confident decisions because of the uncertainty in predictions arisen from insufficient quality and quantity of training data. Here, we show that Bayesian neural networks are useful to quantify the uncertainty of molecular property prediction with three numerical experiments. In particular, it enables us to decompose the predictive variance into the model- and data-driven uncertainties, which helps to elucidate the source of errors. In the logP predictions, we show that data noise affected the data-driven uncertainties more significantly than the model-driven ones. Based on this analysis, we were able to find unexpected errors in the Harvard Clean Energy Project dataset. Lastly, we show that the confidence of prediction is closely related to the predictive uncertainty by performing on bio-activity and toxicity classification problems.