AITopics | distant supervision

Country:

Asia > China > Guangdong Province > Guangzhou (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Mabokela, Koena Ronny, Schlippe, Tim, Raborife, Mpho, Celik, Turgay

Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana

arXiv.org Artificial IntelligenceNov-26-2025

Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.

artificial intelligence, natural language, tweet, (16 more...)

doi: 10.14746/amup.9788323241775

2511.19818

Country:

Africa > South Africa (0.15)
Europe > France (0.14)

Genre: Research Report (0.40)

Industry: Social Sector (0.70)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.76)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.76)

Neural Information Processing SystemsNov-15-2025, 05:06:10 GMT

Debiased and Denoised Entity Recognition from Distant Supervision Haobo Wang

A few past works approach this problem by adopting a self-training framework with a sample-selection mechanism.

computational linguistic, distant supervision, pathway, (14 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.04)
North America > Dominican Republic (0.04)
(10 more...)

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.96)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Neural Information Processing SystemsOct-8-2025, 17:18:09 GMT

544242770e8333875325d013328b2079-Paper-Conference.pdf

machine learning, natural language, relation, (17 more...)

Country:

North America > United States (0.14)
Asia > China > Guangdong Province > Guangzhou (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Africa > Liberia (0.04)

Genre: Research Report (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-8-2025, 10:45:50 GMT

359ddb9caccb4c54cc915dceeacf4892-Paper-Conference.pdf

computational linguistic, distant supervision, pathway, (14 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.04)
North America > Dominican Republic (0.04)
(11 more...)

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Bokaei, Zahra, Magdy, Walid, Webber, Bonnie

Culture Matters in Toxic Language Detection in Persian

arXiv.org Artificial IntelligenceJun-5-2025

Toxic language detection is crucial for creating safer online environments and limiting the spread of harmful content. While toxic language detection has been under-explored in Persian, the current work compares different methods for this task, including fine-tuning, data enrichment, zero-shot and few-shot learning, and cross-lingual transfer learning. What is especially compelling is the impact of cultural context on transfer learning for this task: We show that the language of a country with cultural similarities to Persian yields better results in transfer learning. Conversely, the improvement is lower when the language comes from a culturally distinct country. Warning: This paper contains examples of toxic language that may disturb some readers. These examples are included for the purpose of research on toxic detection.

large language model, machine learning, natural language, (19 more...)

2506.03458

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(16 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-20-2025

Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations

Ding, Yuyang, Qiao, Dan, Li, Juntao, Xu, Jiajie, Chao, Pingfu, Zhou, Xiaofang, Zhang, Min

Abstract--Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods. Index Terms --Distantly supervised learning, Named entity recognition, Noise measurement. With the prosperous development of neural techniques [3]-[5], the past decade has witnessed the tremendous success of the NER tasks. T o achieve a high performance, the need for massive high-quality data is inevitable, whether for the previous fully supervised methods or the recent fine-tuned T ask-Specific Large Language Models like UniNER [6]. However, obtaining massive data with high-quality annotations is either inapplicable or unaffordable. Thus, NER under distant supervision (DS) has been a popular alternative [7], [8]. Distantly supervised NER first aims to annotate an unlabeled dataset utilizing external resources, such as knowledge bases and dictionaries, then train a model on the distantly annotated data. Recently, LLMs have been demonstrated to be proficient annotators for numerous NLP tasks [9]. However, regardless of the distantly supervised method employed, whether traditional rule-based annotating methods like KB-Matching [8] and Dict-Matching [10] or LLMbased annotating methods, considerable label noise is injected into the datasets. Consequently, devising a strategy to train a high-performance NER model on a noisy NER dataset becomes critically important. W e begin with a preliminary study to compare and assess the annotation capabilities of different annotation methods. Y uyang Ding and Dan Qiao contribute equally. Xiaofang Zhou is with the Hong Kong University of Science and T echnol-ogy, Hong Kong, China.

large language model, machine learning, natural language, (17 more...)

2505.12454

Country:

Asia > China > Hong Kong (0.44)
Oceania > Australia > Queensland (0.04)
Europe > United Kingdom > England > Leicestershire (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Kaiser, Magdalena, Ernst, Patrick, Szarvas, György

Learning from Relevant Subgoals in Successful Dialogs using Iterative Training for Task-oriented Dialog Systems

arXiv.org Artificial IntelligenceNov-25-2024

Task-oriented Dialog (ToD) systems have to solve multiple subgoals to accomplish user goals, whereas feedback is often obtained only at the end of the dialog. In this work, we propose SUIT (SUbgoal-aware ITerative Training), an iterative training approach for improving ToD systems. We sample dialogs from the model we aim to improve and determine subgoals that contribute to dialog success using distant supervision to obtain high quality training samples. We show how this data improves supervised fine-tuning or, alternatively, preference learning results. SUIT is able to iteratively generate more data instead of relying on fixed static sets. SUIT reaches new state-of-the-art performance on a popular ToD benchmark.

artificial intelligence, machine learning, natural language, (15 more...)

2411.16305

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
North America > Canada > Ontario > Toronto (0.04)
(5 more...)

Genre:

Research Report (0.50)
Overview (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.83)

Neural Information Processing SystemsOct-11-2024, 03:47:27 GMT

Debiased and Denoised Entity Recognition from Distant Supervision

While distant supervision has been extensively explored and exploited in NLP tasks like named entity recognition, a major obstacle stems from the inevitable noisy distant labels tagged unsupervisedly. A few past works approach this problem by adopting a self-training framework with a sample-selection mechanism. In this work, we innovatively identify two types of biases that were omitted by prior work, and these biases lead to inferior performance of the distant-supervised NER setup. First, we characterize the noise concealed in the distant labels as highly structural rather than fully randomized. Second, the self-training framework would ubiquitously introduce an inherent bias that causes erroneous behavior in both sample selection and eventually prediction.

debiased and denoised entity recognition, distant supervision, self-training framework, (1 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.77)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.61)

arXiv.org Artificial IntelligenceJul-1-2024

Augmenting Document-level Relation Extraction with Efficient Multi-Supervision

Lin, Xiangyu, Jia, Weijia, Gong, Zhiguo

Despite its popularity in sentence-level relation extraction, distantly supervised data is rarely utilized by existing work in document-level relation extraction due to its noisy nature and low information density. Among its current applications, distantly supervised data is mostly used as a whole for pertaining, which is of low time efficiency. To fill in the gap of efficient and robust utilization of distantly supervised training data, we propose Efficient Multi-Supervision for document-level relation extraction, in which we first select a subset of informative documents from the massive dataset by combining distant supervision with expert supervision, then train the model with Multi-Supervision Ranking Loss that integrates the knowledge from multiple sources of supervision to alleviate the effects of noise. The experiments demonstrate the effectiveness of our method in improving the model performance with higher time efficiency than existing baselines.

dataset, ds data, supervision, (14 more...)

2407.01026

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Bosnia and Herzegovina > Federation of Bosnia and Herzegovina > Sarajevo Canton > Sarajevo (0.06)
Europe > Bosnia and Herzegovina > Republika Srpska > Banja Luka (0.05)
(13 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.48)