AITopics | pclean

Collaborating Authors

pclean

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

New system cleans messy data tables automatically

#artificialintelligenceMay-16-2021, 19:23:31 GMT

MIT researchers have created a new system that automatically cleans "dirty data" -- the typos, duplicates, missing values, misspellings, and inconsistencies dreaded by data analysts, data engineers, and data scientists. The system, called PClean, is the latest in a series of domain-specific probabilistic programming languages written by researchers at the Probabilistic Computing Project that aim to simplify and automate the development of AI applications (others include one for 3D perception via inverse graphics and another for modeling time series and databases). According to surveys conducted by Anaconda and Figure Eight, data cleaning can take a quarter of a data scientist's time. Automating the task is challenging because different datasets require different types of cleaning, and common-sense judgment calls about objects in the world are often needed (e.g., which of several cities called "Beverly Hills" someone lives in). PClean provides generic common-sense models for these kinds of judgment calls that can be customized to specific databases and types of errors.

database, knowledge, pclean, (14 more...)

#artificialintelligence

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.40)
North America > United States > California > Los Angeles County > Beverly Hills (0.27)
North America > United States > Texas (0.05)
North America > United States > Missouri (0.05)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.96)
Information Technology > Data Science > Data Quality (0.95)

Add feedback

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming

Lew, Alexander K., Agrawal, Monica, Sontag, David, Mansinghka, Vikash K.

arXiv.org Artificial IntelligenceAug-7-2020

Data cleaning is naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered, corrupted, and joined to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a unified generative modeling architecture for cleaning and normalizing dirty data in diverse domains. Given an unclean dataset and a probabilistic program encoding relevant domain knowledge, PClean learns a structured representation of the data as a relational database of interrelated objects, and uses this latent structure to impute missing values, identify duplicates, detect errors, and propose corrections in the original data table. PClean makes three modeling and inference contributions: (i) a domain-general non-parametric generative model of relational data, for inferring latent objects and their network of latent connections; (ii) a domain-specific probabilistic programming language, for encoding domain knowledge specific to each dataset being cleaned; and (iii) a domain-general inference engine that adapts to each PClean program by constructing data-driven proposals used in sequential Monte Carlo and particle Gibbs. We show empirically that short (< 50-line) PClean programs deliver higher accuracy than state-of-the-art data cleaning systems based on machine learning and weighted logic; that PClean's inference algorithm is faster than generic particle Gibbs inference for probabilistic programs; and that PClean scales to large real-world datasets with millions of rows.

artificial intelligence, bayesian inference, data quality, (16 more...)

arXiv.org Artificial Intelligence

2007.11838

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Health & Medicine > Government Relations & Public Policy (0.93)
Consumer Products & Services (0.68)
Health & Medicine > Health Care Providers & Services > Reimbursement (0.68)

Technology:

Information Technology > Data Science > Data Quality > Data Cleaning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback