Autoencoder-based cleaning in probabilistic databases

Mauritz, R. R., Nijweide, F. P. J., Goseling, J., van Keulen, M.

Jun-17-2021–arXiv.org Artificial Intelligence

In the field of data integration, data quality problems are often encountered when extracting, combining, and merging data. The probabilistic data integration approach represents information about such problems as uncertainties in a probabilistic database. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data to identify and correct doubtful values. A theoretical framework is provided, and experiments show that it can remove significant amounts of noise from categorical and numeric probabilistic data. Our method does not require clean data. We do, however, show that manually cleaning a small fraction of the data significantly improves performance. I. Introduction Data quality problems are a large threat in data science. Specifically, in the field of dataintegration, i.e., combining several data sources into a single and unified view [1], often uncertainties are encountered when extracting, combining, and merging data. These uncertainties can result from the nature of data (such as noise in measurements), but can also be a result of the integration process itself. Information about these uncertainties is considered an important result of the integration process [2]. In probabilistic data integration (PDI), the integration result as well as information on uncertainty is stored in a probabilistic database (PDB) [3]. The PDB maintains possible alternatives for values and records, their likelihoods, and the dependencies among them. The PDI process (see Figure 1) consists of two main phases. This phase is followed by the improvement phase, which gathers evidence while the data is being used for the purpose of gradually improving its quality.

database, experiment, noise, (17 more...)

arXiv.org Artificial Intelligence

Jun-17-2021

arXiv.org PDF

Add feedback

Country:
- Europe
  - Denmark (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.28)

Genre:
- Research Report (0.64)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning
    - Neural Networks (1.00)
  - Data Science > Data Quality (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found