Record fusion: A learning approach

Heidari, Alireza, Michalopoulos, George, Kushagra, Shrinu, Ilyas, Ihab F., Rekatsinas, Theodoros

Jun-17-2020–arXiv.org Machine Learning

Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each cell (or (row, col)) of that database. We use this feature vector alongwith the ground-truth information to learn a classifier for each of the attributes of the database. Our learning algorithm uses a novel stagewise additive model. At each stage, we construct a new feature vector by combining a part of the original feature vector with features computed by the predictions from the previous stage. We then learn a softmax classifier over the new feature space. This greedy stagewise approach can be viewed as a deep model where at each stage, we are adding more complicated non-linear transformations of the original feature vector. We show that our approach fuses records with an average precision of ~98% when source information of records is available, and ~94% without source information across a diverse array of real-world datasets. We compare our approach to a comprehensive collection of data fusion and entity consolidation methods considered in the literature. We show that our approach can achieve an average precision improvement of ~20%/~45% with/without source information respectively.

artificial intelligence, latexit sha1, machine learning, (16 more...)

arXiv.org Machine Learning

Jun-17-2020

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - New York (0.04)
    - Wisconsin > Dane County
      - Madison (0.04)
    - Washington > King County
      - Seattle (0.14)
      - Redmond (0.04)
    - California > Santa Clara County
      - Mountain View (0.14)
  - Canada > Ontario
    - Toronto (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found