corenlp
Fusing Knowledge and Language: A Comparative Study of Knowledge Graph-Based Question Answering with LLMs
Chaudhary, Vaibhav, Soni, Neha, Singh, Narotam, Kapoor, Amita
Knowledge graphs, a powerful tool for structuring information through relational triplets, have recently become the new front-runner in enhancing question-answering systems. While traditional Retrieval Augmented Generation (RAG) approaches are proficient in fact-based and local context-based extraction from concise texts, they encounter limitations when addressing the thematic and holistic understanding of complex, extensive texts, requiring a deeper analysis of both text and context. This paper presents a comprehensive technical comparative study of three different methodologies for constructing knowledge graph triplets and integrating them with Large Language Models (LLMs) for question answering: spaCy, Stanford CoreNLP-OpenIE, and GraphRAG, all leveraging open source technologies. We evaluate the effectiveness, feasibility, and adaptability of these methods by analyzing their capabilities, state of development, and their impact on the performance of LLM-based question answering. Experimental results indicate that while OpenIE provides the most comprehensive coverage of triplets, GraphRAG demonstrates superior reasoning abilities among the three. We conclude with a discussion on the strengths and limitations of each method and provide insights into future directions for improving knowledge graph-based question answering.
Do "English" Named Entity Recognizers Work Well on Global Englishes?
Shan, Alexander, Bauer, John, Carlson, Riley, Manning, Christopher
The vast majority of the popular English named entity recognition (NER) datasets contain American or British English data, despite the existence of many global varieties of English. As such, it is unclear whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset, the Worldwide English NER Dataset, to analyze NER model performance on low-resource English variants from around the world. We test widely used NER toolkits and transformer models, including models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets experienced significant performance drops-over 10 F1 in some cases-when tested on the Worldwide English dataset. Upon examination of region-specific errors, we observe the greatest performance drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance. Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or OntoNotes lost only 1-2 F1 on both test sets.
Information Extraction in Domain and Generic Documents: Findings from Heuristic-based and Data-driven Approaches
Information extraction (IE) plays very important role in natural language processing (NLP) and is fundamental to many NLP applications that used to extract structured information from unstructured text data. Heuristic-based searching and data-driven learning are two main stream implementation approaches. However, no much attention has been paid to document genre and length influence on IE tasks. To fill the gap, in this study, we investigated the accuracy and generalization abilities of heuristic-based searching and data-driven to perform two IE tasks: named entity recognition (NER) and semantic role labeling (SRL) on domain-specific and generic documents with different length. We posited two hypotheses: first, short documents may yield better accuracy results compared to long documents; second, generic documents may exhibit superior extraction outcomes relative to domain-dependent documents due to training document genre limitations. Our findings reveals that no single method demonstrated overwhelming performance in both tasks. For named entity extraction, data-driven approaches outperformed symbolic methods in terms of accuracy, particularly in short texts. In the case of semantic roles extraction, we observed that heuristic-based searching method and data-driven based model with syntax representation surpassed the performance of pure data-driven approach which only consider semantic information. Additionally, we discovered that different semantic roles exhibited varying accuracy levels with the same method. This study offers valuable insights for downstream text mining tasks, such as NER and SRL, when addressing various document features and genres.
Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction
Heid, Stefan, Wever, Marcel, Hüllermeier, Eyke
Syntactic annotation of corpora in the form of part-of-speech (pos) tags is a key requirement for both linguistic research and subsequent automated natural language processing (nlp) tasks. This problem is commonly tackled using machine learning methods, i.e., by training a pos tagger on a sufficiently large corpus of labeled data. While the problem of pos tagging can essentially be considered as solved for modern languages, historical corpora turn out to be much more difficult, especially due to the lack of native speakers and sparsity of training data. Moreover, most texts have no sentences as we know them today, nor a common orthography. These irregularities render the task of automated pos tagging more difficult and error-prone. Under these circumstances, instead of forcing the pos tagger to predict and commit to a single tag, it should be enabled to express its uncertainty. In this paper, we consider pos tagging within the framework of set-valued prediction, which allows the pos tagger to express its uncertainty via predicting a set of candidate pos tags instead of guessing a single one. The goal is to guarantee a high confidence that the correct pos tag is included while keeping the number of candidates small. In our experimental study, we find that extending state-of-the-art pos taggers to set-valued prediction yields more precise and robust taggings, especially for unknown words, i.e., words not occurring in the training data.
8 great Python libraries for natural language processing
Natural language processing, or NLP for short, is best described as "AI for speech and text." The magic behind voice commands, speech and text translation, sentiment analysis, text summarization, and many other linguistic applications and analyses, natural language processing has been improved dramatically through deep learning. The Python language provides a convenient front-end to all varieties of machine learning including NLP. In fact, there is an embarrassment of NLP riches to choose from in the Python ecosystem. In this article we'll explore each of the NLP libraries available for Python--their use cases, their strengths, their weaknesses, and their general level of popularity.
Introduction to StanfordNLP with Python Implementation
A common challenge I came across while learning Natural Language Processing (NLP) – can we build models for non-English languages? The answer has been no for quite a long time. Each language has its own grammatical patterns and linguistic nuances. I could barely contain my excitement when I read the news last week. The authors claimed StanfordNLP could support more than 53 human languages!
StanfordNLP
StanfordNLP is the combination of the software package used by the Stanford team in the CoNLL 2018 Shared Task on Universal Dependency Parsing, and the group's official Python interface to the Stanford CoreNLP software. Aside from the functions it inherits from CoreNLP, it contains tools to convert a string of text to lists of sentences and words, generate base forms of those words, their parts of speech and morphological features, and a syntactic structure that is designed to be parallel among more than 70 languages. This package is built with highly accurate neural network components that enables efficient training and evaluation with your own annotated data. The modules are built on top of PyTorch. To see StanfordNLP's neural pipeline in action, you can launch the Python interactive interpreter, and try the following commands At the end, you should be able to see the dependency parse of the first sentence in the example.
How to recognize a named entity that is lowcase such as kobe bryant by CoreNLP?
First off, you do have to accept that it is harder to get named entities right in lowercase or inconsistently cased English text than in formal text, where capital letters are a great clue. Nevertheless, there are things that you must do to get CoreNLP working fairly well with lowercase text – the default models are trained to work well on well-edited text. If you are working with properly edited text, you should use our default English models. If the text that you are working with is (mainly) lowercase or uppercase, then you should use one of the two solutions presented below. If it's a real mixture (like much social media text), you might use the truecaser solution below, or you might gain by using both the cased and caseless NER models (as a long list of models given to the ner.model property).
Stanford.NLP.CoreNLP - Stanford.NLP.NET
Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework, which makes it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. Its analyses provides the foundational building blocks for higher-level and domain-specific text understanding applications.
Natural Language Processing with Stanford CoreNLP - Cloud Academy
Today, we'll be following up on our recent post on the Google Cloud Natural Language API. In this post, we're going to take a second look at the service and compare it to the Stanford CoreNLP, a well-known suite for Natural Language Processing (NLP). We will walk you through how to get started using the Stanford CoreNLP, and then we'll discuss the strengths and weaknesses of the two solutions. Artificial intelligence and machine learning are some of the hottest topics in IT. The major cloud platforms--Amazon Web Services, Google Cloud Platform, and Microsoft Azure--are increasingly exposing a variety of these functions in a way that makes it easy for developers to integrate them into their apps.