portuguese
Exploring Performance Variations in Finetuned Translators of Ultra-Low Resource Languages: Do Linguistic Differences Matter?
Gonçalves, Isabel, Cavalin, Paulo, Pinhanez, Claudio
Finetuning pre-trained language models with small amounts of data is a commonly-used method to create translators for ultra-low resource languages such as endangered Indigenous languages. However, previous works have reported substantially different performances with translators created using similar methodology and data. In this work we systematically explored possible causes of the performance difference, aiming to determine whether it was a product of different cleaning procedures, limitations of the pre-trained models, the size of the base model, or the size of the training dataset, studying both directions of translation. Our studies, using two Brazilian Indigenous languages, related but with significant structural linguistic characteristics, indicated none or very limited influence from those training factors, suggesting differences between languages may play a significant role in the ability to produce translators by fine-tuning pre-trained models.
- South America > Paraguay (0.14)
- North America > Mexico > Mexico City > Mexico City (0.05)
- South America > Brazil > São Paulo (0.04)
- (14 more...)
ToxSyn: Reducing Bias in Hate Speech Detection via Synthetic Minority Data in Brazilian Portuguese
Brito, Iago Alves, Dollis, Julia Soares, Färber, Fernanda Bufon, Silva, Diogo Fernandes Costa, Filho, Arlindo Rodrigues Galvão
The development of robust hate speech detection systems remains limited by the lack of large-scale, fine-grained training data, especially for languages beyond English. Existing corpora typically rely on coarse toxic/non-toxic labels, and the few that capture hate directed at specific minority groups critically lack the non-toxic counterexamples (i.e., benign text about minorities) required to distinguish genuine hate from mere discussion. We introduce ToxSyn, the first Portuguese large-scale corpus explicitly designed for multi-label hate speech detection across nine protected minority groups. Generated via a controllable four-stage pipeline, ToxSyn includes discourse-type annotations to capture rhetorical strategies of toxic language, such as sarcasm or dehumanization. Crucially, it systematically includes the non-toxic counterexamples absent in all other public datasets. Our experiments reveal a catastrophic, mutual generalization failure between social-media domains and ToxSyn: models trained on social media struggle to generalize to minority-specific contexts, and vice-versa. This finding indicates they are distinct tasks and exposes summary metrics like Macro F1 can be unreliable indicators of true model behavior, as they completely mask model failure. We publicly release ToxSyn at HuggingFace to foster reproducible research on synthetic data generation and benchmark progress in hate-speech detection for low- and mid-resource languages.
Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection
Eilertsen, Brage, Bjørgfinsdóttir, Røskva, Vargas, Francielle, Ramezani-Kebrya, Ali
The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. We evaluated SRA on hate speech benchmarks in English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations. Empirically, SRA achieves 2.4x better explainability compared to current baselines, and produces token-level explanations that are more faithful and human-aligned. In terms of fairness, SRA achieves competitive fairness across all measures, with second-best performance in detecting toxic posts targeting identity groups, while maintaining comparable results on other metrics. These findings demonstrate that incorporating human rationales into attention mechanisms can enhance interpretability and faithfulness without compromising fairness.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (14 more...)
- Law (0.93)
- Information Technology > Security & Privacy (0.93)
Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning
Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabi a - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.
- North America > United States (0.04)
- South America > Brazil > Santa Catarina (0.04)
MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation
Trager, Jackson, Vargas, Francielle, Alves, Diego, Guida, Matteo, Ngueajio, Mikel K., Agrawal, Ameeta, Daryani, Yalda, Karimi-Malekabadi, Farzan, Plaza-del-Arco, Flor Miriam
Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanation using the Moral Foundations Theory. MFTCXplain comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. Our findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning
- North America > United States > California (0.86)
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- (19 more...)
- Law (1.00)
- Government > Regional Government (1.00)
- Media (0.93)
- (2 more...)
Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages
Piot, Paloma, Campos, José Ramom Pichel, Parapar, Javier
Hate speech poses a serious threat to social cohesion and individual well-being, particularly on social media, where it spreads rapidly. While research on hate speech detection has progressed, it remains largely focused on English, resulting in limited resources and benchmarks for low-resource languages. Moreover, many of these languages have multiple linguistic varieties, a factor often overlooked in current approaches. At the same time, large language models require substantial amounts of data to perform reliably, a requirement that low-resource languages often cannot meet. In this work, we address these gaps by compiling a meta-collection of hate speech datasets for European Spanish, standardised with unified labels and metadata. This collection is based on a systematic analysis and integration of existing resources, aiming to bridge the data gap and support more consistent and scalable hate speech detection. We extended this collection by translating it into European Portuguese and into a Galician standard that is more convergent with Spanish and another Galician variant that is more convergent with Portuguese, creating aligned multilingual corpora. Using these resources, we establish new benchmarks for hate speech detection in Iberian languages. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, providing baseline results for future research. Moreover, we perform a cross-lingual analysis with our target languages. Our findings underscore the importance of multilingual and variety-aware approaches in hate speech detection and offer a foundation for improved benchmarking in underrepresented European languages.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (15 more...)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
- Information Technology > Services (0.67)
- Law > Civil Rights & Constitutional Law (0.47)
- Government > Regional Government (0.46)
AKCIT-FN at CheckThat! 2025: Switching Fine-Tuned SLMs and LLM Prompting for Multilingual Claim Normalization
Almada, Fabrycio Leite Nakano, Mariano, Kauan Divino Pouso, Dutra, Maykon Adriell, Monteiro, Victor Emanuel da Silva, Gomes, Juliana Resplande Sant'Anna, Filho, Arlindo Rodrigues Galvão, Soares, Anderson da Silva
Claim normalization, the transformation of informal social media posts into concise, self-contained statements, is a crucial step in automated fact-checking pipelines. This paper details our submission to the CLEF-2025 CheckThat! Task~2, which challenges systems to perform claim normalization across twenty languages, divided into thirteen supervised (high-resource) and seven zero-shot (no training data) tracks. Our approach, leveraging fine-tuned Small Language Models (SLMs) for supervised languages and Large Language Model (LLM) prompting for zero-shot scenarios, achieved podium positions (top three) in fifteen of the twenty languages. Notably, this included second-place rankings in eight languages, five of which were among the seven designated zero-shot languages, underscoring the effectiveness of our LLM-based zero-shot strategy. For Portuguese, our initial development language, our system achieved an average METEOR score of 0.5290, ranking third. All implementation artifacts, including inference, training, evaluation scripts, and prompt configurations, are publicly available at https://github.com/ju-resplande/checkthat2025_normalization.
- Europe > Switzerland (0.04)
- Europe > Netherlands (0.04)
- South America > Brazil > Goiás (0.04)
- (11 more...)
- Research Report (0.64)
- Workflow (0.48)
BRoverbs -- Measuring how much LLMs understand Portuguese proverbs
Almeida, Thales Sales, Bonás, Giovana Kerche, Santos, João Guilherme Alves
Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at https://huggingface.co/datasets/Tropic-AI/BRoverbs.
- South America > Brazil > São Paulo > Campinas (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > Central America (0.04)
- (4 more...)
Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
Almeida, Thales Sales, Nogueira, Rodrigo, Pedrini, Helio
The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Brazil > São Paulo > Campinas (0.04)
- Europe > Switzerland (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)
LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization
Chary, Luis Felipe, Ramirez, Miguel Arjona
We present LatinX, a multilingual text-to-speech (TTS) model for cascaded speech-to-speech translation that preserves the source speaker's identity across languages. LatinX is a 12-layer decoder-only Transformer trained in three stages: (i) pre-training for text-to-audio mapping, (ii) supervised fine-tuning for zero-shot voice cloning, and (iii) alignment with Direct Preference Optimization (DPO) using automatically labeled pairs based on Word Error Rate (WER) and speaker-similarity metrics. Trained on English and Romance languages with emphasis on Portuguese, LatinX with DPO consistently reduces WER and improves objective similarity over the fine-tuned baseline. Human evaluations further indicate stronger perceived speaker similarity than a strong baseline (XTTSv2), revealing gaps between objective and subjective measures. We provide cross-lingual analyses and discuss balanced preference signals and lower-latency architectures as future work.