AITopics

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.51)
Information Technology > Artificial Intelligence > Natural Language (0.36)

Neural Information Processing SystemsNov-21-2025, 14:47:29 GMT

Flexible Models for Microclustering with Application to Entity Resolution

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.

application, flexible model, microclustering, (7 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.88)

Arya Mazumdar, Barna Saha

Clustering with Noisy Queries

Neural Information Processing SystemsNov-21-2025, 13:11:57 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, social media, (18 more...)

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > Michigan (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Neural Information Processing SystemsNov-21-2025, 06:47:43 GMT

Flexible Models for Microclustering with Application to Entity Resolution

Brenda Betancourt, Giacomo Zanella, Jeffrey W. Miller, Hanna Wallach, Abbas Zaidi, Beka Steorts

However, for some applications, this assumption is inappropriate.

information retrieval, machine learning, natural language, (20 more...)

Country:

Asia > Middle East > Syria (0.14)
North America > United States (0.14)
Europe > Italy (0.05)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Industry:

Government (0.68)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.44)

Arya Mazumdar, Soumyabrata Pal

Semisupervised Clustering, AND-Queries and Locally Encodable Source Coding

Neural Information Processing SystemsNov-21-2025, 05:36:28 GMT

Source coding is the canonical problem of data compression in information theory. In a locally encodable source coding, each compressed bit depends on only few bits of the input.

artificial intelligence, machine learning, natural language, (17 more...)

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.70)
(2 more...)

Sharifi, Mohammadreza, Ahmadzadeh, Danial

Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution

arXiv.org Artificial IntelligenceOct-27-2025

Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from computational costs or the excessive need for parallel computation. In this study, we introduce a scalable hybrid framework, which is designed to address several important problems, including scalability, noise robustness, and reliable results. We utilized a pre-trained language model to encode each structured data into corresponding semantic embedding vectors. Subsequently, after retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage using fuzzy string matching techniques to refine classification on the unlabeled data. This approach was applied to a real-world entity resolution task, which exposed a linkage between a central user management database and numerous shared hosting server records. Compared to other methods, this approach exhibits an outstanding performance in terms of both processing time and robustness, making it a reliable solution for a server-side product. Crucially, this efficiency does not compromise results, as the system maintains a high retrieval recall of approximately 0.97. The scalability of the framework makes it deployable on standard CPU-based infrastructure, offering a practical and effective solution for enterprise-level data integrity auditing.

information retrieval, machine learning, natural language, (18 more...)

2509.1747

Country: Asia (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

arXiv.org Artificial IntelligenceOct-17-2025

Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation

Zheng, Yilun, Yang, Dan, Li, Jie, Shang, Lin, Chen, Lihui, Xu, Jiahao, Luan, Sitao

Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.

entity resolution, large language model, machine learning, (14 more...)

2510.14271

Country:

North America (0.67)
Asia > Middle East > UAE (0.28)

Genre: Research Report (0.64)

Industry: Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Sarkar, Sujoy, Sarkar, Gourav, Jagadeeshan, Manoj Balaji, Sandhan, Jivnesh, Krishna, Amrith, Goyal, Pawan

Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking

arXiv.org Artificial IntelligenceSep-25-2025

High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present Mahānāma, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the Mahābhārata, the world's longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of Mahānāma, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. Mahānāma thus provides a unique benchmark for advancing entity resolution, especially in literary domains.

computational linguistic, information retrieval, natural language, (18 more...)

2509.19844

Country:

North America > United States (1.00)
Europe (1.00)
Asia > India > West Bengal (0.14)

Genre: Research Report (0.82)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)

Neural Information Processing SystemsAug-19-2025, 23:36:47 GMT

b0ba5c44aaf65f6ca34cf116e6d82ebf-AuthorFeedback.pdf

algorithm, entity resolution, kwikcluster, (12 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.51)
Information Technology > Artificial Intelligence > Natural Language (0.36)

arXiv.org Artificial IntelligenceAug-7-2025

A Robust and Efficient Pipeline for Enterprise-Level Large-Scale Entity Resolution

Kannangara, Sandeepa, Abrahamyan, Arman, Elias, Daniel, Kilby, Thomas, Dar, Nadav, Pizzato, Luiz, Leontjeva, Anna, Jermyn, Dan

Entity resolution (ER) remains a significant challenge in data management, especially when dealing with large datasets. This paper introduces MERAI (Massive Entity Resolution using AI), a robust and efficient pipeline designed to address record deduplication and linkage issues in high-volume datasets at an enterprise level. The pipeline's resilience and accuracy have been validated through various large-scale record deduplication and linkage projects. To evaluate MERAI's performance, we compared it with two well-known entity resolution libraries, Dedupe and Splink. While Dedupe failed to scale beyond 2 million records due to memory constraints, MERAI successfully processed datasets of up to 15.7 million records and produced accurate results across all experiments. Experimental data demonstrates that MERAI outperforms both baseline systems in terms of matching accuracy, with consistently higher F1 scores in both deduplication and record linkage tasks. MERAI offers a scalable and reliable solution for enterprise-level large-scale entity resolution, ensuring data integrity and consistency in real-world applications.

information retrieval, machine learning, natural language, (16 more...)

2508.03767

Country: Oceania > Australia (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)