galician
Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages
Piot, Paloma, Campos, José Ramom Pichel, Parapar, Javier
Hate speech poses a serious threat to social cohesion and individual well-being, particularly on social media, where it spreads rapidly. While research on hate speech detection has progressed, it remains largely focused on English, resulting in limited resources and benchmarks for low-resource languages. Moreover, many of these languages have multiple linguistic varieties, a factor often overlooked in current approaches. At the same time, large language models require substantial amounts of data to perform reliably, a requirement that low-resource languages often cannot meet. In this work, we address these gaps by compiling a meta-collection of hate speech datasets for European Spanish, standardised with unified labels and metadata. This collection is based on a systematic analysis and integration of existing resources, aiming to bridge the data gap and support more consistent and scalable hate speech detection. We extended this collection by translating it into European Portuguese and into a Galician standard that is more convergent with Spanish and another Galician variant that is more convergent with Portuguese, creating aligned multilingual corpora. Using these resources, we establish new benchmarks for hate speech detection in Iberian languages. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, providing baseline results for future research. Moreover, we perform a cross-lingual analysis with our target languages. Our findings underscore the importance of multilingual and variety-aware approaches in hate speech detection and offer a foundation for improved benchmarking in underrepresented European languages.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (15 more...)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
- Information Technology > Services (0.67)
- Law > Civil Rights & Constitutional Law (0.47)
- Government > Regional Government (0.46)
Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices
Seller, Luís Couto, Torres, Íñigo Sanz, Vogel-Fernández, Adrián, Carballo, Carlos González, Sánchez, Pedro Miguel Sánchez, Martín, Adrián Carruana, Ambite, Enrique de Miguel
Large Language Models have significantly advanced natural language processing, achieving remarkable performance in tasks such as language generation, translation, and reasoning. However, their substantial computational requirements restrict deployment to high-end systems, limiting accessibility on consumer-grade devices. This challenge is especially pronounced for under-resourced languages like those spoken in the Iberian Peninsula, where relatively limited linguistic resources and benchmarks hinder effective evaluation. This work presents a comprehensive evaluation of compact state-of-the-art LLMs across several essential NLP tasks tailored for Iberian languages. The results reveal that while some models consistently excel in certain tasks, significant performance gaps remain, particularly for languages such as Basque. These findings highlight the need for further research on balancing model compactness with robust multilingual performance
- Europe > Spain > Galicia > Madrid (0.04)
- North America > United States (0.04)
- North America > Mexico (0.04)
- North America > Cuba (0.04)
Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts
De Leon, Frances Laureano, Madabushi, Harish Tayyar, Lee, Mark G.
Multiword expressions, characterised by non-compositional meanings and syntactic irregularities, are an example of nuanced language. These expressions can be used literally or idiomatically, leading to significant changes in meaning. While large language models have demonstrated strong performance across many tasks, their ability to handle such linguistic subtleties remains uncertain. Therefore, this study evaluates how state-of-the-art language models process the ambiguity of potentially idiomatic multiword expressions, particularly in contexts that are less frequent, where models are less likely to rely on memorisation. By evaluating models across in Portuguese and Galician, in addition to English, and using a novel code-switched dataset and a novel task, we find that large language models, despite their strengths, struggle with nuanced language. In particular, we find that the latest models, including GPT-4, fail to outperform the xlm-roBERTa-base baselines in both detection and semantic tasks, with especially poor performance on the novel tasks we introduce, despite its similarity to existing tasks. Overall, our results demonstrate that multiword expressions, especially those which are ambiguous, continue to be a challenge to models.
- South America > Colombia > Meta Department > Villavicencio (0.05)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
- (9 more...)
Truth Knows No Language: Evaluating Truthfulness Beyond English
Figueras, Blanca Calvo, Sagarzazu, Eneko, Etxaniz, Julen, Barnes, Jeremy, Gamallo, Pablo, Flores, Iria De Dios, Agerri, Rodrigo
We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability. Dataset and code are publicly available under open licenses.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Virginia (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (5 more...)
- Education (0.55)
- Government > Regional Government > North America Government > United States Government (0.46)
Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition
Piñeiro-Martín, Andrés, García-Mateo, Carmen, Docío-Fernández, Laura, López-Pérez, María del Carmen, Rehm, Georg
This paper addresses the challenge of integrating low-resource languages into multilingual automatic speech recognition (ASR) systems. We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets, to facilitate the integration of low-resource languages into pre-trained multilingual ASR models within the context of continual multilingual learning. We fine-tune the Whisper multilingual ASR model on five high-resource languages and one low-resource language, employing language-weighted dynamic cross-entropy and data augmentation. The results show a remarkable 6.69% word error rate (WER) reduction for the low-resource language compared to the fine-tuned model without applying our approach, and a 48.86% WER reduction compared to the original Whisper model. In addition, our approach yields an average WER reduction of 3.29% across the six languages, showing no degradation for the high-resource languages.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Spain > Galicia > A Coruña Province > Santiago de Compostela (0.04)
- Europe > Portugal > Braga > Braga (0.04)
- (2 more...)
Open Generative Large Language Models for Galician
Gamallo, Pablo, Rodríguez, Pablo, de-Dios-Flores, Iria, Sotelo, Susana, Paniagua, Silvia, Bardanca, Daniel, Pichel, José Ramom, Garcia, Marcos
Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
- Europe > Spain > Galicia > A Coruña Province > Santiago de Compostela (0.05)
- Europe > Portugal (0.04)
- (7 more...)
Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens
San, Nay, Paraskevopoulos, Georgios, Arora, Aryaman, He, Xiluo, Kaur, Prabhjot, Adams, Oliver, Jurafsky, Dan
While massively multilingual speech models like wav2vec 2.0 XLSR-128 can be directly fine-tuned for automatic speech recognition (ASR), downstream performance can still be relatively poor on languages that are under-represented in the pre-training data. Continued pre-training on 70-200 hours of untranscribed speech in these languages can help -- but what about languages without that much recorded data? For such cases, we show that supplementing the target language with data from a similar, higher-resource 'donor' language can help. For example, continued pre-training on only 10 hours of low-resource Punjabi supplemented with 60 hours of donor Hindi is almost as good as continued pretraining on 70 hours of Punjabi. By contrast, sourcing data from less similar donors like Bengali does not improve ASR performance. To inform donor language selection, we propose a novel similarity metric based on the sequence distribution of induced acoustic units: the Acoustic Token Distribution Similarity (ATDS). Across a set of typologically different target languages (Punjabi, Galician, Iban, Setswana), we show that the ATDS between the target language and its candidate donors precisely predicts target language ASR performance.
- Africa > South Africa (0.04)
- Europe > Spain (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (21 more...)
Conversations in Galician: a Large Language Model for an Underrepresented Language
Bao, Eliseo, Pérez, Anxo, Parapar, Javier
The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model, by following the Alpaca format. This work contributes to the research on multilingual models tailored for low-resource settings, a crucial endeavor in ensuring the inclusion of all linguistic communities in the development of Large Language Models. Another noteworthy aspect of this research is the exploration of how knowledge of a closely related language, in this case, Portuguese, can assist in generating coherent text when training resources are scarce. Both the Galician Alpaca dataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and we have made the source code available to facilitate replication of this experiment and encourage further advancements for underrepresented languages.
Unsupervised Paraphrasing of Multiword Expressions
Wada, Takashi, Matsumoto, Yuji, Baldwin, Timothy, Lau, Jey Han
We propose an unsupervised approach to paraphrasing multiword expressions (MWEs) in context. Our model employs only monolingual corpus data and pre-trained language models (without fine-tuning), and does not make use of any external resources such as dictionaries. We evaluate our method on the SemEval 2022 idiomatic semantic text similarity task, and show that it outperforms all unsupervised systems and rivals supervised systems.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (8 more...)
AI Song Contest: The Eurovision for machine composers
Amid flamboyant performances, flashy costumes and pyrotechnics, the Eurovision Song Contest is probably one of the weirdest, most peculiar shows ever to grace our screens. It's the kind of event that an alien species or an emotionless android would have a hard time understanding, were they watching Europe tune into the annual celebration of kitsch, patriotism and unity. But could machines understand - and reproduce - such a uniquely human experience? The answer could be found at the AI Song Contest, a Eurovision-inspired music competition where all the songs are written by artificial intelligence. Since its creation in 2020, the song contest has been hosted every year by the Belgian city of Liège, where teams of data scientists, programmers, and musicians from all over the world participate with the compositions they have created with the help of AI.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)