ptype: Probabilistic Type Inference

Ceritli, Taha, Williams, Christopher K. I., Geddes, James

Nov-22-2019–arXiv.org Machine Learning

The data type, missing data and, anomalies can be defined in broad terms as follows: The data type is the common characteristic that is expected to be shared by entries in a column, such as integers, strings, IP addresses, dates, etc., while missing data denotes an absence of a data value which can be encoded in various ways, and anomalies refer to values whose types differ from the given column type or the missing type. In order to model above types, we have developed PFSMs that can generate values from the corresponding domains. This, in turn, allows us to calculate the probability of a given data value being generated by a particular PFSM. We then combine these PFSMs in our model such that a data column x can be annotated via probabilistic inference in the proposed model, i.e., given a column of data, we can infer column type, and rows with missing and anomalous values.

column type, data type, probability, (17 more...)

arXiv.org Machine Learning

Nov-22-2019

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Australian Capital Territory > Canberra (0.04)
- North America > United States
  - Massachusetts (0.04)
  - Pennsylvania > Allegheny County
    - Pittsburgh (0.04)
  - New York > New York County
    - New York City (0.04)
  - Iowa > Story County
    - Ames (0.04)
  - California > San Francisco County
    - San Francisco (0.04)
- Europe > United Kingdom
  - England > Greater London > London (0.04)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine > Therapeutic Area (1.00)

Technology:
- Information Technology
  - Data Science
    - Data Quality (1.00)
    - Data Mining (1.00)
  - Artificial Intelligence
    - Representation & Reasoning > Uncertainty (0.66)
    - Machine Learning
      - Performance Analysis > Accuracy (1.00)
      - Learning Graphical Models (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found