translation performance
Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
Sayeedi, Md. Faiyaz Abdullah, Alam, Md. Mahbub, Rahman, Subhey Sadi, Islam, Md. Adnanul, Deepti, Jannatul Ferdous, Mohiuddin, Tasnim, Islam, Md Mofijul, Shatabda, Swakkhar
The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles
Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Li, Lujun, Song, Yewei, Sleem, Lama, Wang, Yiqun, Xu, Yangjie, Lothritz, Cedric, Gentile, Niccolo, State, Radu, Bissyande, Tegawende F., Klein, Jacques
Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.
SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish
McGiff, Josh, Nikolov, Nikola S.
Fine-tuning is widely used to tailor large language models for specific tasks such as neural machine translation (NMT). However, leveraging transfer learning is computationally expensive when fine-tuning large multilingual models with billions of parameters, thus creating a barrier to entry for researchers working on low-resource domains such as Irish translation. Parameter-efficient fine-tuning (PEFT) bridges this gap by training on a fraction of the original model parameters, with the Low-Rank Adaptation (LoRA) approach introducing small, trainable adapter layers. We introduce SemiAdapt and SemiLoRA as semi-supervised inference-efficient approaches that strengthen domain adaptation and lead to improved overall performance in NMT. We demonstrate that SemiAdapt can outperform full-domain fine-tuning, while most notably, SemiLoRA can propel PEFT methods to match or even outperform full-model fine-tuning. We further evaluate domain-by-dataset fine-tuning and demonstrate that our embedding-based inference methods perform especially well on larger and noisier corpora. All Irish translation models developed in this work are released as open resources. These methods aim to make high-quality domain adaptation and fine-tuning more accessible to researchers working with low-resource languages.
LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning
Gao, Changjiang, Huang, Zixian, Gong, Jingyang, Huang, Shujian, Li, Lei, Yuan, Fei
General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies layer-selective tuning only on parallel data. Following this pipeline, we introduce the Qwen3-XPlus models, which demonstrate significant improvements in translation performance across both high- and lowresource languages, achieving 15+ spBLEU and 40+ xComet in low-resource languages, like Swahili. Interestingly, training only with small parallel datasets, Qwen3-XPlus achieves an average improvement of 1+ points on 7 multilingual tasks while maintaining proficiency comparable to the Qwen3 instruct model in 15 popular reasoning datasets. This work offers a promising approach to multilingual enhancement, significantly reducing complexity and enhancing accessibility for a wider range of languages. The code and model are publicly available.
Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models
Akera, Benjamin, Ouma, Evelyn Nafula, Yiga, Gilbert, Walukagga, Patrick, Natukunda, Phionah, Saaka, Trevor, Nsumba, Solomon, Nabukeera, Lilian Teddy, Muhanguzi, Joel, Sekalala, Imran, Namara, Nimpamya Janat, Bainomugisha, Engineer, Mwebaze, Ernest, Quinn, John
There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.
Exploring NLP Benchmarks in an Extremely Low-Resource Setting
The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.
Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
Hennara, Khalil, Hreden, Muhammad, Hamed, Mohamed Motaism, Aldallal, Zeina, Chrouf, Sara, AlModhayan, Safwan
We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.
LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation
Wei, Jingxuan, Jia, Caijun, Chen, Qi, Cai, Yujun, Sun, Linzhuang, Zhang, Xiangxiang, Wu, Gaowei, Yu, Bihui
Multimodal Machine Translation (MMT) enhances translation quality by incorporating visual context, helping to resolve textual ambiguities. While existing MMT methods perform well in bilingual settings, extending them to multilingual translation remains challenging due to cross-lingual interference and ineffective parameter-sharing strategies. To address this, we propose LLaVA-NeuMT, a novel multimodal multilingual translation framework that explicitly models language-specific and language-agnostic representations to mitigate multilingual interference. Our approach consists of a layer selection mechanism that identifies the most informative layers for different language pairs and a neuron-level adaptation strategy that dynamically selects language-specific and agnostic neurons to improve translation quality while reducing redundancy. We conduct extensive experiments on the M3-Multi30K and M3-AmbigCaps datasets, demonstrating that LLaVA-NeuMT, while fine-tuning only 40\% of the model parameters, surpasses full fine-tuning approaches and ultimately achieves SOTA results on both datasets. Our analysis further provides insights into the importance of selected layers and neurons in multimodal multilingual adaptation, offering an efficient and scalable solution to cross-lingual adaptation in multimodal translation.
SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
Lu, Hongyuan, Li, Zixuan, Zhang, Zefan, Lam, Wai
There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}
Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family
Lin, Yumeng, Duan, Xufeng, Haslett, David, Chen, Yige, Cai, Zhenguang G.
Brain and Mind Institute, The Chinese University of Hong Kong, Hong Kong, Hong Kong SAR, China. Correspondence should be addressed to Zhenguang G. Cai, Department of Linguistics and Modern Languages, Leung Kau Kui Building, The Chinese University of Hong Kong, Shatin, Hong Kong SAR; zhenguangcai@cuhk.edu.hk. Abstract: Large l anguage m odels have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs --particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation . We evaluate two large language model s, GPT - 4 and Llama 2, by performing round-trip translation s . Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low - resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. L anguage family also exert s an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages. 1 INTRODUCTION Large Language Models (LLMs) demonstrated advanced multilingual capabilities.