Goto

Collaborating Authors

 entity recognition model


TriNER: A Series of Named Entity Recognition Models For Hindi, Bengali & Marathi

arXiv.org Artificial Intelligence

India's rich cultural and linguistic diversity poses various challenges in the domain of Natural Language Processing (NLP), particularly in Named Entity Recognition (NER). NER is a NLP task that aims to identify and classify tokens into different entity groups like Person, Location, Organization, Number, etc. This makes NER very useful for downstream tasks like context-aware anonymization. This paper details our work to build a multilingual NER model for the three most spoken languages in India - Hindi, Bengali & Marathi. We train a custom transformer model and fine tune a few pretrained models, achieving an F1 Score of 92.11 for a total of 6 entity groups. Through this paper, we aim to introduce a single model to perform NER and significantly reduce the inconsistencies in entity groups and tag names, across the three languages.


Deep Learning Based Named Entity Recognition Models for Recipes

arXiv.org Artificial Intelligence

Recipes are cultural capsules transmitted across generations via unstructured text. Automated protocols for recognizing named entities, the building blocks of recipe text, are of immense value for various applications ranging from information extraction to novel recipe generation. Named entity recognition is a technique for extracting information from unstructured or semi-structured data with known labels. Starting with manually-annotated data of 6,611 ingredient phrases, we created an augmented dataset of 26,445 phrases cumulatively. Simultaneously, we systematically cleaned and analyzed ingredient phrases from RecipeDB, the gold-standard recipe data repository, and annotated them using the Stanford NER. Based on the analysis, we sampled a subset of 88,526 phrases using a clustering-based approach while preserving the diversity to create the machine-annotated dataset. A thorough investigation of NER approaches on these three datasets involving statistical, fine-tuning of deep learning-based language models and few-shot prompting on large language models (LLMs) provides deep insights. We conclude that few-shot prompting on LLMs has abysmal performance, whereas the fine-tuned spaCy-transformer emerges as the best model with macro-F1 scores of 95.9%, 96.04%, and 95.71% for the manually-annotated, augmented, and machine-annotated datasets, respectively.


Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

arXiv.org Artificial Intelligence

Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.


LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using Uncertainty

arXiv.org Artificial Intelligence

Named Entity Recognition (NER) serves as a fundamental task in natural language understanding, bearing direct implications for web content analysis, search engines, and information retrieval systems. Fine-tuned NER models exhibit satisfactory performance on standard NER benchmarks. However, due to limited fine-tuning data and lack of knowledge, it performs poorly on unseen entity recognition. As a result, the usability and reliability of NER models in web-related applications are compromised. Instead, Large Language Models (LLMs) like GPT-4 possess extensive external knowledge, but research indicates that they lack specialty for NER tasks. Furthermore, non-public and large-scale weights make tuning LLMs difficult. To address these challenges, we propose a framework that combines small fine-tuned models with LLMs (LinkNER) and an uncertainty-based linking strategy called RDC that enables fine-tuned models to complement black-box LLMs, achieving better performance. We experiment with both standard NER test sets and noisy social media datasets. LinkNER enhances NER task performance, notably surpassing SOTA models in robustness tests. We also quantitatively analyze the influence of key components like uncertainty estimation methods, LLMs, and in-context learning on diverse NER tasks, offering specific web-related recommendations.


Training a Named Entity Recognition Model Without Data

#artificialintelligence

Named Entity Recognition(NER) is the task of recognizing entity names, such as person name, locations, and organizations, within a text. This task serves as a fundamental module for various NLP applications including chatbots, search engines, and translation systems. We can find NER datasets for generic entities easily, but obtaining data for specific domains can be challenging. Labeling NER data is more difficult than simple text classification, making it challenging to create large-scale domain-specific NER datasets. In this post, I will demonstrate how to train NER model without any labeled data.


Researchers claim bias in AI named entity recognition models

#artificialintelligence

Twitter researchers claim to have found evidence of demographic bias in named entity recognition, the first step toward generating automated knowledge bases, or the repositories leveraged by services like search engines. They say their analysis reveals AI performs better at identifying names from specific groups and the biases manifest in syntax, semantics, and how word uses vary across linguistic contexts. Knowledge bases are essentially databases containing information about entities -- people, places, and things. In 2012, Google launched a knowledge base -- the Knowledge Graph -- to enhance Google search results with hundreds of billions of facts gathered from sources including Wikipedia, Wikidata, and CIA World Factbook. Microsoft provides a knowledge base with over 150,000 articles created by support professionals who've resolved issues for its customers. But while the usefulness of knowledge bases is not in dispute, the researchers assert the embeddings used to represent entities in them exhibit bias against certain groups of people.