Goto

Collaborating Authors

 named-entity recognition


Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data

Micallef, Kurt, Habash, Nizar, Borg, Claudia

arXiv.org Artificial Intelligence

Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.


PBa-LLM: Privacy- and Bias-aware NLP using Named-Entity Recognition (NER)

Mancera, Gonzalo, Morales, Aythami, Fierrez, Julian, Tolosana, Ruben, Penna, Alejandro, Lopez-Duran, Miguel, Jurado, Francisco, Ortigosa, Alvaro

arXiv.org Artificial Intelligence

The use of Natural Language Processing (NLP) in high-stakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named-Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacy-and Bias-aware LLMs (PBa-LLMs). Note that the proposed PBa-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.


LLMs as Data Annotators: How Close Are We to Human Performance

Haq, Muhammad Uzair Ul, Rigoni, Davide, Sperduti, Alessandro

arXiv.org Artificial Intelligence

In NLP, fine-tuning LLMs is effective for various applications but requires high-quality annotated data. However, manual annotation of data is labor-intensive, time-consuming, and costly. Therefore, LLMs are increasingly used to automate the process, often employing in-context learning (ICL) in which some examples related to the task are given in the prompt for better performance. However, manually selecting context examples can lead to inefficiencies and suboptimal model performance. This paper presents comprehensive experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task. The evaluation encompasses models with approximately $7$B and $70$B parameters, including both proprietary and non-proprietary models. Furthermore, leveraging the success of Retrieval-Augmented Generation (RAG), it also considers a method that addresses the limitations of ICL by automatically retrieving contextual examples, thereby enhancing performance. The results highlight the importance of selecting the appropriate LLM and embedding model, understanding the trade-offs between LLM sizes and desired performance, and the necessity to direct research efforts towards more challenging datasets.


Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks

Yildirim, Savas

arXiv.org Artificial Intelligence

Deep learning-based and lately Transformer-based language models have been dominating the studies of natural language processing in the last years. Thanks to their accurate and fast fine-tuning characteristics, they have outperformed traditional machine learning-based approaches and achieved state-of-the-art results for many challenging natural language understanding (NLU) problems. Recent studies showed that the Transformer-based models such as BERT, which is Bidirectional Encoder Representations from Transformers, have reached impressive achievements on many tasks. Moreover, thanks to their transfer learning capacity, these architectures allow us to transfer pre-built models and fine-tune them to specific NLU tasks such as question answering. In this study, we provide a Transformer-based model and a baseline benchmark for the Turkish Language. We successfully fine-tuned a Turkish BERT model, namely BERTurk that is trained with base settings, to many downstream tasks and evaluated with a the Turkish Benchmark dataset. We showed that our studies significantly outperformed other existing baseline approaches for Named-Entity Recognition, Sentiment Analysis, Question Answering and Text Classification in Turkish Language. We publicly released these four fine-tuned models and resources in reproducibility and with the view of supporting other Turkish researchers and applications.


CLaCLab at SocialDisNER: Using Medical Gazetteers for Named-Entity Recognition of Disease Mentions in Spanish Tweets

Verma, Harsh, Bagherzadeh, Parsa, Bergler, Sabine

arXiv.org Artificial Intelligence

The simplicity of this pipeline This paper summarizes the CLaC submission and its knowledge injection from readily available for SMM4H 2022 Task 10 which domain resources rather than training purely from concerns the recognition of diseases mentioned training data make our system's strength.


GitHub - chakki-works/seqeval: A Python framework for sequence labeling evaluation(named-entity recognition, pos tagging, etc...)

#artificialintelligence

This is well-tested by using the Perl script conlleval, which can be used for measuring the performance of a system that has processed the CoNLL-2000 shared task data. The default mode is compatible with conlleval. If you want to use the default mode, you don't need to specify it: In strict mode, the inputs are evaluated according to the specified schema. The behavior of the strict mode is different from the default one which is designed to simulate conlleval. If you want to use the strict mode, please specify mode'strict' and scheme arguments at the same time:


Zero-Shot Learning in Named-Entity Recognition with External Knowledge

Van Hoang, Nguyen, Mulvad, Soeren Hougaard, Rong, Dexter Neo Yuan, Yue, Yang

arXiv.org Artificial Intelligence

A significant shortcoming of current state-of-the-art (SOTA) named-entity recognition (NER) systems is their lack of generalization to unseen domains, which poses a major problem since obtaining labeled data for NER in a new domain is expensive and time-consuming. We propose ZERO, a model that performs zero-shot and few-shot learning in NER to generalize to unseen domains by incorporating pre-existing knowledge in the form of semantic word embeddings. ZERO first obtains contextualized word representations of input sentences using the model LUKE, reduces their dimensionality, and compares them directly with the embeddings of the external knowledge, allowing ZERO to be trained to recognize unseen output entities. We find that ZERO performs well on unseen NER domains with an average macro F1 score of 0.23, outperforms LUKE in few-shot learning, and even achieves competitive scores on an in-domain comparison. The performance across source-target domain pairs is shown to be inversely correlated with the pairs' KL divergence.


Fine-Tuning Transformers for NLP

#artificialintelligence

You can see a complete working example in our Colab Notebook, and you can play with the trained models on HuggingFace. Since being first developed and released in the Attention Is All You Need paper Transformers have completely redefined the field of Natural Language Processing (NLP) setting the state-of-the-art on numerous tasks such as question answering, language generation, and named-entity recognition. Here we won't go into too much detail about what a Transformer is, but rather how to apply and train them to help achieve some task at hand. The main things to keep in mind conceptually about Transformers are that they are really good at dealing with sequential data (text, speech, etc.), they act as an encoder-decoder framework where data is mapped to some representational space by the encoder before then being mapped to the output by way of the decoder, and they scale incredibly well to parallel processing hardware (GPUs). Transformers in the field of Natural Language Processing have been trained on massive amounts of text data which allow them to understand both the syntax and semantics of a language very well.


Transfer Learning for Named-Entity Recognition with Neural Networks

Lee, Ji Young, Dernoncourt, Franck, Szolovits, Peter

arXiv.org Machine Learning

Recent approaches based on artificial neural networks (ANNs) have shown promising results for named-entity recognition (NER). In order to achieve high performances, ANNs need to be trained on a large labeled dataset. However, labels might be difficult to obtain for the dataset on which the user wants to perform NER: label scarcity is particularly pronounced for patient note de-identification, which is an instance of NER. In this work, we analyze to what extent transfer learning may address this issue. In particular, we demonstrate that transferring an ANN model trained on a large labeled dataset to another dataset with a limited number of labels improves upon the state-of-the-art results on two different datasets for patient note de-identification.


NeuroNER: an easy-to-use program for named-entity recognition based on neural networks

Dernoncourt, Franck, Lee, Ji Young, Szolovits, Peter

arXiv.org Machine Learning

Named-entity recognition (NER) aims at identifying entities of interest in a text. Artificial neural networks (ANNs) have recently been shown to outperform existing NER systems. However, ANNs remain challenging to use for non-expert users. In this paper, we present NeuroNER, an easyto-use named-entity recognition tool based on ANNs. Users can annotate entities using a graphical web-based user interface (BRAT): the annotations are then used to train an ANN, which in turn predict entities' locations and categories in new texts. NeuroNER makes this annotationtraining-prediction flow smooth and accessible to anyone.