Goto

Collaborating Authors

 spacy


UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Imperial, Joseph Marvin, Barayan, Abdullah, Stodden, Regina, Wilkens, Rodrigo, Sanchez, Ricardo Munoz, Gao, Lingyun, Torgbi, Melissa, Knight, Dawn, Forey, Gail, Jablonkai, Reka R., Kochmar, Ekaterina, Reynolds, Robert, Ribeiro, Eugénio, Saggion, Horacio, Volodina, Elena, Vajjala, Sowmya, François, Thomas, Alva-Manchego, Fernando, Madabushi, Harish Tayyar

arXiv.org Artificial Intelligence

We introduce UniversalCEFR, a large-scale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference) levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community.


Fusing Knowledge and Language: A Comparative Study of Knowledge Graph-Based Question Answering with LLMs

Chaudhary, Vaibhav, Soni, Neha, Singh, Narotam, Kapoor, Amita

arXiv.org Artificial Intelligence

Knowledge graphs, a powerful tool for structuring information through relational triplets, have recently become the new front-runner in enhancing question-answering systems. While traditional Retrieval Augmented Generation (RAG) approaches are proficient in fact-based and local context-based extraction from concise texts, they encounter limitations when addressing the thematic and holistic understanding of complex, extensive texts, requiring a deeper analysis of both text and context. This paper presents a comprehensive technical comparative study of three different methodologies for constructing knowledge graph triplets and integrating them with Large Language Models (LLMs) for question answering: spaCy, Stanford CoreNLP-OpenIE, and GraphRAG, all leveraging open source technologies. We evaluate the effectiveness, feasibility, and adaptability of these methods by analyzing their capabilities, state of development, and their impact on the performance of LLM-based question answering. Experimental results indicate that while OpenIE provides the most comprehensive coverage of triplets, GraphRAG demonstrates superior reasoning abilities among the three. We conclude with a discussion on the strengths and limitations of each method and provide insights into future directions for improving knowledge graph-based question answering.


BrightCookies at SemEval-2025 Task 9: Exploring Data Augmentation for Food Hazard Classification

Papadopoulou, Foteini, Mutlu, Osman, Özen, Neris, van der Velden, Bas H. M., Hendrickx, Iris, Hürriyetoğlu, Ali

arXiv.org Artificial Intelligence

This paper presents our system developed for the SemEval-2025 Task 9: The Food Hazard Detection Challenge. The shared task's objective is to evaluate explainable classification systems for classifying hazards and products in two levels of granularity from food recall incident reports. In this work, we propose text augmentation techniques as a way to improve poor performance on minority classes and compare their effect for each category on various transformer and machine learning models. We explore three word-level data augmentation techniques, namely synonym replacement, random word swapping, and contextual word insertion. The results show that transformer models tend to have a better overall performance. None of the three augmentation techniques consistently improved overall performance for classifying hazards and products. We observed a statistically significant improvement (P < 0.05) in the fine-grained categories when using the BERT model to compare the baseline with each augmented model. Compared to the baseline, the contextual words insertion augmentation improved the accuracy of predictions for the minority hazard classes by 6%. This suggests that targeted augmentation of minority classes can improve the performance of transformer models.


NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach

Hiltmann, Torsten, Dröge, Martin, Dresselhaus, Nicole, Grallert, Till, Althage, Melanie, Bayer, Paul, Eckenstaler, Sophie, Mendi, Koray, Schmitz, Jascha Marijn, Schneider, Philipp, Sczeponik, Wiebke, Skibba, Anica

arXiv.org Artificial Intelligence

Named entity recognition (NER) is a core task for historical research in automatically establishing all references to people, places, events and the like. Yet, do to the high linguistic and genre diversity of sources, only limited canonisation of spellings, the level of required historical domain knowledge, and the scarcity of annotated training data, established approaches to natural language processing (NLP) have been both extremely expensive and yielded only unsatisfactory results in terms of recall and precision. Our paper introduces a new approach. We demonstrate how readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks, spaCy and flair, for NER in historical documents by seven to twentytwo percent higher F1-Scores. Our ablation study shows how providing historical context to the task and a bit of persona modelling that turns focus away from a purely linguistic approach are core to a successful prompting strategy. We also demonstrate that, contrary to our expectations, providing increasing numbers of examples in few-shot approaches does not improve recall or precision below a threshold of 16-shot. In consequence, our approach democratises access to NER for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools and instead leveraging natural language prompts and consumer-grade tools and frontends.


Comparative Performance of Advanced NLP Models and LLMs in Multilingual Geo-Entity Detection

Kopanov, Kalin

arXiv.org Artificial Intelligence

The integration of advanced Natural Language Processing (NLP) methodologies and Large Language Models (LLMs) has significantly enhanced the extraction and analysis of geospatial data from multilingual texts, impacting sectors such as national and international security. This paper presents a comprehensive evaluation of leading NLP models -- SpaCy, XLM-RoBERTa, mLUKE, GeoLM -- and LLMs, specifically OpenAI's GPT 3.5 and GPT 4, within the context of multilingual geo-entity detection. Utilizing datasets from Telegram channels in English, Russian, and Arabic, we examine the performance of these models through metrics such as accuracy, precision, recall, and F1 scores, to assess their effectiveness in accurately identifying geospatial references. The analysis exposes each model's distinct advantages and challenges, underscoring the complexities involved in achieving precise geo-entity identification across varied linguistic landscapes. The conclusions drawn from this experiment aim to direct the enhancement and creation of more advanced and inclusive NLP tools, thus advancing the field of geospatial analysis and its application to global security.


Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

Ali, Nurshat Fateh, Mohtasim, Md. Mahdi, Mosharrof, Shakil, Krishna, T. Gopi

arXiv.org Artificial Intelligence

This research presents and compares multiple approaches to automate the generation of literature reviews using several Natural Language Processing (NLP) techniques and retrieval-augmented generation (RAG) with a Large Language Model (LLM). The ever-increasing number of research articles provides a huge challenge for manual literature review. It has resulted in an increased demand for automation. Developing a system capable of automatically generating the literature reviews from only the PDF files as input is the primary objective of this research work. The effectiveness of several Natural Language Processing (NLP) strategies, such as the frequency-based method (spaCy), the transformer model (Simple T5), and retrieval-augmented generation (RAG) with Large Language Model (GPT-3.5-turbo), is evaluated to meet the primary objective. The SciTLDR dataset is chosen for this research experiment and three distinct techniques are utilized to implement three different systems for auto-generating the literature reviews. The ROUGE scores are used for the evaluation of all three systems. Based on the evaluation, the Large Language Model GPT-3.5-turbo achieved the highest ROUGE-1 score, 0.364. The transformer model comes in second place and spaCy is at the last position. Finally, a graphical user interface is created for the best system based on the large language model.


Discovering Latent Structural Causal Models from Spatio-Temporal Data

Wang, Kun, Varambally, Sumanth, Watson-Parris, Duncan, Ma, Yi-An, Yu, Rose

arXiv.org Machine Learning

Many important phenomena in scientific fields such as climate, neuroscience, and epidemiology are naturally represented as spatiotemporal gridded data with complex interactions. For example, in climate science, researchers aim to uncover how large-scale events, such as the North Atlantic Oscillation (NAO) and the Antarctic Oscillation (AAO), influence other global processes. Inferring causal relationships from these data is a challenging problem compounded by the high dimensionality of such data and the correlations between spatially proximate points. We present SPACY (SPAtiotemporal Causal discoverY), a novel framework based on variational inference, designed to explicitly model latent time-series and their causal relationships from spatially confined modes in the data. Our method uses an end-to-end training process that maximizes an evidence-lower bound (ELBO) for the data likelihood. Theoretically, we show that, under some conditions, the latent variables are identifiable up to transformation by an invertible matrix. Empirically, we show that SPACY outperforms state-of-the-art baselines on synthetic data, remains scalable for large grids, and identifies key known phenomena from real-world climate data.


Social Evolution of Published Text and The Emergence of Artificial Intelligence Through Large Language Models and The Problem of Toxicity and Bias

Khan, Arifa, Saravanan, P., Venkatesan, S. K

arXiv.org Artificial Intelligence

We provide a birds eye view of the rapid developments in AI and Deep Learning that has led to the path-breaking emergence of AI in Large Language Models. The aim of this study is to place all these developments in a pragmatic broader historical social perspective without any exaggerations while at the same time without any pessimism that created the AI winter in the 1970s to 1990s. We also at the same time point out toxicity, bias, memorization, sycophancy, logical inconsistencies, hallucinations that exist just as a warning to the overly optimistic. We note here that just as this emergence of AI seems to occur at a threshold point in the number of neural connections or weights, it has also been observed that human brain and especially the cortex region is nothing special or extraordinary but simply a case of scaled-up version of the primate brain and that even the human intelligence seems like an emergent phenomena of scale.


Augmenty: A Python Library for Structured Text Augmentation

Enevoldsen, Kenneth

arXiv.org Artificial Intelligence

Text augmentation is useful for tool for training (Wei and Zou 2019) and evaluating (Ribeiro et al. 2020) natural language processing models and systems. Despite its utility existing libraries for text augmentation often exhibit limitations in terms of functionality and flexibility, being confined to basic tasks such as text-classification or cater to specific downstream use-cases such as estimating robustness (Goel et al. 2021). Recognizing these constraints, Augmenty is a tool for structured text augmentation of the text along with its annotations. Augmenty integrates seamlessly with the popular NLP library spaCy (Honnibal et al. 2020) and seeks to be compatible with all models and tasks supported by spaCy. Augmenty provides a wide range of augmenters which can be combined in a flexible manner to create complex augmentation pipelines. It also includes a set of primitives that can be used to create custom augmenters such as word replacement augmenters. This functionality allows for augmentations within a range of applications such as named entity recognition (NER), part-of-speech tagging, and dependency parsing.


Named entity recognition using GPT for identifying comparable companies

Covas, Eurico

arXiv.org Artificial Intelligence

For both public and private firms, comparable companies' analysis is widely used as a method for company valuation. In particular, the method is of great value for valuation of private equity companies. The several approaches to the comparable companies' method usually rely on a qualitative approach to identifying similar peer companies, which tend to use established industry classification schemes and/or analyst intuition and knowledge. However, more quantitative methods have started being used in the literature and in the private equity industry, in particular, machine learning clustering, and natural language processing (NLP). For NLP methods, the process consists of extracting product entities from e.g., the company's website or company descriptions from some financial database system and then to perform similarity analysis. Here, using companies' descriptions/summaries from publicly available companies' Wikipedia websites, we show that using large language models (LLMs), such as GPT from OpenAI, has a much higher precision and success rate than using the standard named entity recognition (NER) methods which use manual annotation. We demonstrate quantitatively a higher precision rate, and show that, qualitatively, it can be used to create appropriate comparable companies peer groups which could then be used for equity valuation.