AITopics | language resource and evaluation conference

Collaborating Authors

language resource and evaluation conference

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BERnaT: Basque Encoders for Representing Natural Textual Diversity

Azurmendi, Ekhi, de Landa, Joseba Fernandez, Bengoetxea, Jaione, Heredia, Maite, Etxaniz, Julen, Zubillaga, Mikel, Soraluze, Ander, Soroa, Aitor

arXiv.org Artificial IntelligenceDec-4-2025

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.

artificial intelligence, computational linguistic, natural language, (14 more...)

arXiv.org Artificial Intelligence

2512.03903

Country:

North America > United States (0.46)
North America > Mexico (0.28)
Europe > Austria (0.28)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Alabi, Jesujoba O., Hedderich, Michael A., Adelani, David Ifeoluwa, Klakow, Dietrich

arXiv.org Artificial IntelligenceOct-3-2025

With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 884 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.21315

Country:

Asia (1.00)
Africa (1.00)
Europe > Spain (0.67)
North America > United States > Minnesota (0.27)

Genre: Overview (1.00)

Industry:

Health & Medicine (1.00)
Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Add feedback

The TUB Sign Language Corpus Collection

Avramidis, Eleftherios, Czehmann, Vera, Deckert, Fabian, Hufe, Lorenz, Lipski, Aljoscha, Villalobos, Yuni Amaloa Quintero, Rhee, Tae Kwon, Shi, Mengqian, Stölting, Lennart, Nunnari, Fabrizio, Möller, Sebastian

arXiv.org Artificial IntelligenceAug-8-2025

We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.

artificial intelligence, machine translation, natural language, (15 more...)

arXiv.org Artificial Intelligence

2508.05374

Country:

North America > United States (0.68)
Europe > Germany (0.53)
South America > Chile (0.46)

Genre: Research Report (0.40)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Add feedback

Lemmatization as a Classification Task: Results from Arabic across Multiple Genres

Saeed, Mostafa, Habash, Nizar

arXiv.org Artificial IntelligenceJun-24-2025

Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2506.18399

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Revisiting Noise in Natural Language Processing for Computational Social Science

Borenstein, Nadav

arXiv.org Artificial IntelligenceMar-10-2025

Computational Social Science (CSS) is an emerging field driven by the unprecedented availability of human-generated content for researchers. This field, however, presents a unique set of challenges due to the nature of the theories and datasets it explores, including highly subjective tasks and complex, unstructured textual corpora. Among these challenges, one of the less well-studied topics is the pervasive presence of noise. This thesis aims to address this gap in the literature by presenting a series of interconnected case studies that examine different manifestations of noise in CSS. These include character-level errors following the OCR processing of historical records, archaic language, inconsistencies in annotations for subjective and ambiguous tasks, and even noise and biases introduced by large language models during content generation. This thesis challenges the conventional notion that noise in CSS is inherently harmful or useless. Rather, it argues that certain forms of noise can encode meaningful information that is invaluable for advancing CSS research, such as the unique communication styles of individuals or the culture-dependent nature of datasets and tasks. Further, this thesis highlights the importance of nuance in dealing with noise and the considerations CSS researchers must address when encountering it, demonstrating that different types of noise require distinct strategies.

camembert-ft-sq-fr camembert-ft-sq-fr 54 54 52, convenient qualitative analysis and visualisation, hedonism pleasure and sensuous gratification, (16 more...)

arXiv.org Artificial Intelligence

2503.07395

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Poland (0.14)
Europe > Finland (0.14)
(130 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)
(2 more...)

Industry:

Media > News (1.00)
Leisure & Entertainment (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(10 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(4 more...)

Add feedback

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment

Elmadani, Khalid N., Habash, Nizar, Taha-Thomure, Hanada

arXiv.org Artificial IntelligenceFeb-19-2025

This paper introduces the Balanced Arabic Readability Evaluation Corpus BAREC, a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 68,182 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.3%, reflecting a high level of substantial agreement. Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods. To support research and education, we will make BAREC openly available, along with detailed annotation guidelines and benchmark results.

artificial intelligence, machine learning, natural language, (12 more...)

arXiv.org Artificial Intelligence

2502.1352

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Europe > Slovenia (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(18 more...)

Genre: Research Report > New Finding (0.88)

Industry: Education > Educational Setting > K-12 Education > Primary School (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Salamandra Technical Report

Gonzalez-Agirre, Aitor, Pàmies, Marc, Llop, Joan, Baucells, Irene, Da Dalt, Severino, Tamayo, Daniel, Saiz, José Javier, Espuña, Ferran, Prats, Jaume, Aula-Blasco, Javier, Mina, Mario, Pikabea, Iñigo, Rubio, Adrián, Shvets, Alexander, Sallés, Anna, Lacunza, Iñaki, Palomar, Jorge, Falcão, Júlia, Tormo, Lucía, Vasquez-Reina, Luis, Marimon, Montserrat, Pareras, Oriol, Ruiz-Fernández, Valle, Villegas, Marta

arXiv.org Artificial IntelligenceFeb-13-2025

This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and safety.With this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.

kideak modu eraginkorragoan aurkitzen zituzten, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.08489

Country:

Europe > Spain (1.00)
North America > United States (0.92)
Asia > Middle East > UAE (0.45)

Genre: Research Report > New Finding (0.92)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(4 more...)

Technology:

Information Technology > Software (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(3 more...)

Add feedback

The Zeno's Paradox of `Low-Resource' Languages

Nigatu, Hellina Hailu, Tonja, Atnafu Lambebo, Rosman, Benjamin, Solorio, Thamar, Choudhury, Monojit

arXiv.org Artificial IntelligenceOct-28-2024

The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a `low-resource language.' To understand how NLP papers define and study `low resource' languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword `low-resource.' Based on our analysis, we show how several interacting axes contribute to `low-resourcedness' of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.

artificial intelligence, computational linguistic, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.20817

Country:

Africa > Kenya (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.06)
North America > Canada > Ontario > Toronto (0.05)
(22 more...)

Genre: Research Report (0.82)

Industry: Education (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context

Kleinle, Steffen, Prange, Jakob, Friedrich, Annemarie

arXiv.org Artificial IntelligenceJul-22-2024

When immigrating to a new country, it is easy to feel overwhelmed by the need to obtain information on financial support, housing, schooling, language courses, and other issues. If relocation is rushed or even forced, the necessity for high-quality answers to such questions is all the more urgent. Official immigration counselors are usually overbooked, and online systems could guide newcomers to the requested information or a suitable counseling service. To this end, we present OMoS-QA, a dataset of German and English questions paired with relevant trustworthy documents and manually annotated answers, specifically tailored to this scenario. Questions are automatically generated with an open-source large language model (LLM) and answer sentences are selected by crowd workers with high agreement. With our data, we conduct a comparison of 5 pretrained LLMs on the task of extractive question answering (QA) in German and English. Across all models and both languages, we find high precision and low-to-mid recall in selecting answer sentences, which is a favorable trade-off to avoid misleading users. This performance even holds up when the question language does not match the document language. When it comes to identifying unanswerable questions given a context, there are larger differences between the two languages.

computational linguistic, dataset, proceedings, (11 more...)

arXiv.org Artificial Intelligence

2407.15736

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > Dominican Republic (0.04)
(16 more...)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government > Regional Government (1.00)
Government > Immigration & Customs (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures

Lankford, Séamus

arXiv.org Artificial IntelligenceMar-3-2024

In the current machine translation (MT) landscape, the Transformer architecture stands out as the gold standard, especially for high-resource language pairs. This research delves into its efficacy for low-resource language pairs including both the English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi language pairs. Notably, the study identifies the optimal hyperparameters and subword model type to significantly improve the translation quality of Transformer models for low-resource language pairs. The scarcity of parallel datasets for low-resource languages can hinder MT development. To address this, gaHealth was developed, the first bilingual corpus of health data for the Irish language. Focusing on the health domain, models developed using this in-domain dataset exhibited very significant improvements in BLEU score when compared with models from the LoResMT2021 Shared Task. A subsequent human evaluation using the multidimensional quality metrics error taxonomy showcased the superior performance of the Transformer system in reducing both accuracy and fluency errors compared to an RNN-based counterpart. Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source applications streamlined for the development, fine-tuning, and deployment of neural machine translation models. These tools considerably simplify the setup and evaluation process, making MT more accessible to both developers and translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes eco-friendly natural language processing research by highlighting the environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM demonstrated advancements in translation performance for two low-resource language pairs: English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 Shared Task.

evaluation and explainable ai architecture, infrastructure rapid prototype development, language resource and evaluation conference, (14 more...)

arXiv.org Artificial Intelligence

2403.0158

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Portugal > Lisbon > Lisbon (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(37 more...)

Genre:

Summary/Review (1.00)
Research Report > New Finding (1.00)
Overview (1.00)
(2 more...)

Industry:

Information Technology > Services (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback