Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations

Ding, Yuyang, Qiao, Dan, Li, Juntao, Xu, Jiajie, Chao, Pingfu, Zhou, Xiaofang, Zhang, Min

May-20-2025–arXiv.org Artificial Intelligence

Abstract--Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods. Index Terms --Distantly supervised learning, Named entity recognition, Noise measurement. With the prosperous development of neural techniques [3]-[5], the past decade has witnessed the tremendous success of the NER tasks. T o achieve a high performance, the need for massive high-quality data is inevitable, whether for the previous fully supervised methods or the recent fine-tuned T ask-Specific Large Language Models like UniNER [6]. However, obtaining massive data with high-quality annotations is either inapplicable or unaffordable. Thus, NER under distant supervision (DS) has been a popular alternative [7], [8]. Distantly supervised NER first aims to annotate an unlabeled dataset utilizing external resources, such as knowledge bases and dictionaries, then train a model on the distantly annotated data. Recently, LLMs have been demonstrated to be proficient annotators for numerous NLP tasks [9]. However, regardless of the distantly supervised method employed, whether traditional rule-based annotating methods like KB-Matching [8] and Dict-Matching [10] or LLMbased annotating methods, considerable label noise is injected into the datasets. Consequently, devising a strategy to train a high-performance NER model on a noisy NER dataset becomes critically important. W e begin with a preliminary study to compare and assess the annotation capabilities of different annotation methods. Y uyang Ding and Dan Qiao contribute equally. Xiaofang Zhou is with the Hong Kong University of Science and T echnol-ogy, Hong Kong, China.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

May-20-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China
    - Heilongjiang Province > Harbin (0.04)
    - Hong Kong (0.44)
    - Jiangsu Province > Nanjing (0.04)
    - Tianjin Province > Tianjin (0.04)
  - Middle East > Jordan (0.04)
  - Pakistan (0.04)
- Europe > United Kingdom
  - England > Leicestershire (0.04)
- Oceania > Australia
  - Queensland (0.04)

Genre:
- Research Report (1.00)

Industry:
- Education (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found