distant supervision
Learning from Both Structural and Textual Knowledge for Inductive Knowledge Graph Completion
In this paper, we propose a two-stage framework that imposes both structural and textual knowledge to learn rule-based systems. In the first stage, we compute a set of triples with confidence scores (called soft triples) from a text corpus by distant supervision, where a textual entailment model with multi-instance learning is exploited to estimate whether a given triple is entailed by a set of sentences. In the second stage, these soft triples are used to learn a rule-based model for KGC.
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana
Mabokela, Koena Ronny, Schlippe, Tim, Raborife, Mpho, Celik, Turgay
Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.
- North America > United States (0.04)
- Europe > Germany (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Africa > South Africa > Gauteng > Johannesburg (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- North America > Dominican Republic (0.04)
- (10 more...)
- North America > United States (0.14)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Africa > Liberia (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- North America > Dominican Republic (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.96)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Culture Matters in Toxic Language Detection in Persian
Bokaei, Zahra, Magdy, Walid, Webber, Bonnie
Toxic language detection is crucial for creating safer online environments and limiting the spread of harmful content. While toxic language detection has been under-explored in Persian, the current work compares different methods for this task, including fine-tuning, data enrichment, zero-shot and few-shot learning, and cross-lingual transfer learning. What is especially compelling is the impact of cultural context on transfer learning for this task: We show that the language of a country with cultural similarities to Persian yields better results in transfer learning. Conversely, the improvement is lower when the language comes from a culturally distinct country. Warning: This paper contains examples of toxic language that may disturb some readers. These examples are included for the purpose of research on toxic detection.
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (16 more...)
Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations
Ding, Yuyang, Qiao, Dan, Li, Juntao, Xu, Jiajie, Chao, Pingfu, Zhou, Xiaofang, Zhang, Min
Abstract--Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods. Index Terms --Distantly supervised learning, Named entity recognition, Noise measurement. With the prosperous development of neural techniques [3]-[5], the past decade has witnessed the tremendous success of the NER tasks. T o achieve a high performance, the need for massive high-quality data is inevitable, whether for the previous fully supervised methods or the recent fine-tuned T ask-Specific Large Language Models like UniNER [6]. However, obtaining massive data with high-quality annotations is either inapplicable or unaffordable. Thus, NER under distant supervision (DS) has been a popular alternative [7], [8]. Distantly supervised NER first aims to annotate an unlabeled dataset utilizing external resources, such as knowledge bases and dictionaries, then train a model on the distantly annotated data. Recently, LLMs have been demonstrated to be proficient annotators for numerous NLP tasks [9]. However, regardless of the distantly supervised method employed, whether traditional rule-based annotating methods like KB-Matching [8] and Dict-Matching [10] or LLMbased annotating methods, considerable label noise is injected into the datasets. Consequently, devising a strategy to train a high-performance NER model on a noisy NER dataset becomes critically important. W e begin with a preliminary study to compare and assess the annotation capabilities of different annotation methods. Y uyang Ding and Dan Qiao contribute equally. Xiaofang Zhou is with the Hong Kong University of Science and T echnol-ogy, Hong Kong, China.
- Asia > China > Hong Kong (0.44)
- Oceania > Australia > Queensland (0.04)
- Europe > United Kingdom > England > Leicestershire (0.04)
- (5 more...)
Learning from Relevant Subgoals in Successful Dialogs using Iterative Training for Task-oriented Dialog Systems
Kaiser, Magdalena, Ernst, Patrick, Szarvas, György
Task-oriented Dialog (ToD) systems have to solve multiple subgoals to accomplish user goals, whereas feedback is often obtained only at the end of the dialog. In this work, we propose SUIT (SUbgoal-aware ITerative Training), an iterative training approach for improving ToD systems. We sample dialogs from the model we aim to improve and determine subgoals that contribute to dialog success using distant supervision to obtain high quality training samples. We show how this data improves supervised fine-tuning or, alternatively, preference learning results. SUIT is able to iteratively generate more data instead of relying on fixed static sets. SUIT reaches new state-of-the-art performance on a popular ToD benchmark.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Dominican Republic (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (5 more...)
- Research Report (0.50)
- Overview (0.47)
Debiased and Denoised Entity Recognition from Distant Supervision
While distant supervision has been extensively explored and exploited in NLP tasks like named entity recognition, a major obstacle stems from the inevitable noisy distant labels tagged unsupervisedly. A few past works approach this problem by adopting a self-training framework with a sample-selection mechanism. In this work, we innovatively identify two types of biases that were omitted by prior work, and these biases lead to inferior performance of the distant-supervised NER setup. First, we characterize the noise concealed in the distant labels as highly structural rather than fully randomized. Second, the self-training framework would ubiquitously introduce an inherent bias that causes erroneous behavior in both sample selection and eventually prediction.
Augmenting Document-level Relation Extraction with Efficient Multi-Supervision
Lin, Xiangyu, Jia, Weijia, Gong, Zhiguo
Despite its popularity in sentence-level relation extraction, distantly supervised data is rarely utilized by existing work in document-level relation extraction due to its noisy nature and low information density. Among its current applications, distantly supervised data is mostly used as a whole for pertaining, which is of low time efficiency. To fill in the gap of efficient and robust utilization of distantly supervised training data, we propose Efficient Multi-Supervision for document-level relation extraction, in which we first select a subset of informative documents from the massive dataset by combining distant supervision with expert supervision, then train the model with Multi-Supervision Ranking Loss that integrates the knowledge from multiple sources of supervision to alleviate the effects of noise. The experiments demonstrate the effectiveness of our method in improving the model performance with higher time efficiency than existing baselines.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Bosnia and Herzegovina > Federation of Bosnia and Herzegovina > Sarajevo Canton > Sarajevo (0.06)
- Europe > Bosnia and Herzegovina > Republika Srpska > Banja Luka (0.05)
- (13 more...)