record linkage
- North America > United States (0.14)
- Europe > Germany (0.14)
- Asia > Singapore (0.05)
- Asia > China > Beijing > Beijing (0.04)
- Banking & Finance (1.00)
- Information Technology > Security & Privacy (0.46)
- Leisure & Entertainment > Games (0.46)
- Asia > Middle East > Syria (0.14)
- North America > United States (0.14)
- Europe > Italy (0.05)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Government (0.68)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.44)
- Banking & Finance (1.00)
- Information Technology > Security & Privacy (0.46)
- Leisure & Entertainment > Games (0.46)
A Robust and Efficient Pipeline for Enterprise-Level Large-Scale Entity Resolution
Kannangara, Sandeepa, Abrahamyan, Arman, Elias, Daniel, Kilby, Thomas, Dar, Nadav, Pizzato, Luiz, Leontjeva, Anna, Jermyn, Dan
Entity resolution (ER) remains a significant challenge in data management, especially when dealing with large datasets. This paper introduces MERAI (Massive Entity Resolution using AI), a robust and efficient pipeline designed to address record deduplication and linkage issues in high-volume datasets at an enterprise level. The pipeline's resilience and accuracy have been validated through various large-scale record deduplication and linkage projects. To evaluate MERAI's performance, we compared it with two well-known entity resolution libraries, Dedupe and Splink. While Dedupe failed to scale beyond 2 million records due to memory constraints, MERAI successfully processed datasets of up to 15.7 million records and produced accurate results across all experiments. Experimental data demonstrates that MERAI outperforms both baseline systems in terms of matching accuracy, with consistently higher F1 scores in both deduplication and record linkage tasks. MERAI offers a scalable and reliable solution for enterprise-level large-scale entity resolution, ensuring data integrity and consistency in real-world applications.
- North America > United States > North Carolina (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
Leveraging Language Models for Automated Patient Record Linkage
Beheshti, Mohammad, Gondara, Lovedeep, Zachary, Iris
Objective: Healthcare data fragmentation presents a major challenge for linking patient data, necessitating robust record linkage to integrate patient records from diverse sources. This study investigates the feasibility of leveraging language models for automated patient record linkage, focusing on two key tasks: blocking and matching. Materials and Methods: We utilized real-world healthcare data from the Missouri Cancer Registry and Research Center, linking patient records from two independent sources using probabilistic linkage as a baseline. A transformer-based model, RoBERTa, was fine-tuned for blocking using sentence embeddings. For matching, several language models were experimented under fine-tuned and zero-shot settings, assessing their performance against ground truth labels. Results: The fine-tuned blocking model achieved a 92% reduction in the number of candidate pairs while maintaining near-perfect recall. In the matching task, fine-tuned Mistral-7B achieved the best performance with only 6 incorrect predictions. Among zero-shot models, Mistral-Small-24B performed best, with a total of 55 incorrect predictions. Discussion: Fine-tuned language models achieved strong performance in patient record blocking and matching with minimal errors. However, they remain less accurate and efficient than a hybrid rule-based and probabilistic approach for blocking. Additionally, reasoning models like DeepSeek-R1 are impractical for large-scale record linkage due to high computational costs. Conclusion: This study highlights the potential of language models for automating patient record linkage, offering improved efficiency by eliminating the manual efforts required to perform patient record linkage. Overall, language models offer a scalable solution that can enhance data integration, reduce manual effort, and support disease surveillance and research.
- North America > United States > Missouri > Boone County > Columbia (0.14)
- North America > Canada > British Columbia (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Health Care Technology > Medical Record (1.00)
Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based on gradual information disclosure
Rohde, Florens, Christen, Victor, Franke, Martin, Rahm, Erhard
Record linkage, also known as entity resolution, aims at identifying different representations of the same real-world entity, such as a person. It is a crucial step in many data integration tasks in order to combine multiple data sources allowing enhanced data analysis. Typically, unique record identifiers are not available which would enable a join-like operation. Therefore, records are compared pairwise based on their identifying attributes, such as first name, last name and date of birth, and classified as match or non-match. However, record linkage may potentially harm the privacy of individuals by combining information that can be used against their interests. As a consequence, the conduction of such a linkage is subject to many legal and organizational constraints [CRS20]. Privacypreserving record linkage (PPRL) methods aim for enabling such linkages without sharing sensitive plaintext information between the data owners or with a third party. To protect the identifying data, the data owners encode it before sending it to an independent linkage unit which performs the matching on the encoded data only. A variety of such perturbation-based encoding techniques have been proposed, but the most popular and a quasi-standard is based on Bloom filters [Gk21].
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Germany > Saxony > Leipzig (0.05)
- Oceania > Australia (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.68)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.42)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.36)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.34)
Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data
Record linkage integrates diverse data sources by identifying records that refer to the same entity. In the context of mineral site records, accurate record linkage is crucial for identifying and mapping mineral deposits. Properly linking records that refer to the same mineral deposit helps define the spatial coverage of mineral areas, benefiting resource identification and site data archiving. Mineral site record linkage falls under the spatial record linkage category since the records contain information about the physical locations and non-spatial attributes in a tabular format. The task is particularly challenging due to the heterogeneity and vast scale of the data. While prior research employs pre-trained discriminative language models (PLMs) on spatial entity linkage, they often require substantial amounts of curated ground-truth data for fine-tuning. Gathering and creating ground truth data is both time-consuming and costly. Therefore, such approaches are not always feasible in real-world scenarios where gold-standard data are unavailable. Although large generative language models (LLMs) have shown promising results in various natural language processing tasks, including record linkage, their high inference time and resource demand present challenges. We propose a method that leverages an LLM to generate training data and fine-tune a PLM to address the training data gap while preserving the efficiency of PLMs. Our approach achieves over 45\% improvement in F1 score for record linkage compared to traditional PLM-based methods using ground truth data while reducing the inference time by nearly 18 times compared to relying on LLMs. Additionally, we offer an automated pipeline that eliminates the need for human intervention, highlighting this approach's potential to overcome record linkage challenges.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (15 more...)
- Materials > Metals & Mining (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
Towards Split Learning-based Privacy-Preserving Record Linkage
Zervas, Michail, Karakasidis, Alexandros
Split Learning has been recently introduced to facilitate applications where user data privacy is a requirement. However, it has not been thoroughly studied in the context of Privacy-Preserving Record Linkage, a problem in which the same real-world entity should be identified among databases from different dataholders, but without disclosing any additional information. In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching, by introducing a novel training method through the utilization of Reference Sets, which are publicly available data corpora, showcasing minimal matching impact against a traditional centralized SVM-based technique.
- North America > United States > District of Columbia > Washington (0.05)
- Europe > North Macedonia (0.04)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
- (2 more...)
Flexible Models for with Application to Entity Resolution
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman-Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.
- Asia > Middle East > Syria (0.14)
- North America > United States (0.14)
- Europe > Italy (0.05)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Government (0.68)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.84)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Distributed Record Linkage in Healthcare Data with Apache Spark
Heydari, Mohammad, Sarshar, Reza, Soltanshahi, Mohammad Ali
Healthcare data is a valuable resource for research, analysis, and decision-making in the medical field. However, healthcare data is often fragmented and distributed across various sources, making it challenging to combine and analyze effectively. Record linkage, also known as data matching, is a crucial step in integrating and cleaning healthcare data to ensure data quality and accuracy. Apache Spark, a powerful open-source distributed big data processing framework, provides a robust platform for performing record linkage tasks with the aid of its machine learning library. In this study, we developed a new distributed data-matching model based on the Apache Spark Machine Learning library. To ensure the correct functioning of our model, the validation phase has been performed on the training data. The main challenge is data imbalance because a large amount of data is labeled false, and a small number of records are labeled true. By utilizing SVM and Regression algorithms, our results demonstrate that research data was neither over-fitted nor under-fitted, and this shows that our distributed model works well on the data.
- Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Consumer Health (1.00)