Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family

Lin, Yumeng, Duan, Xufeng, Haslett, David, Chen, Yige, Cai, Zhenguang G.

arXiv.org Artificial Intelligence 

Brain and Mind Institute, The Chinese University of Hong Kong, Hong Kong, Hong Kong SAR, China. Correspondence should be addressed to Zhenguang G. Cai, Department of Linguistics and Modern Languages, Leung Kau Kui Building, The Chinese University of Hong Kong, Shatin, Hong Kong SAR; zhenguangcai@cuhk.edu.hk. Abstract: Large l anguage m odels have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs --particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation . We evaluate two large language model s, GPT - 4 and Llama 2, by performing round-trip translation s . Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low - resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. L anguage family also exert s an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages. 1 INTRODUCTION Large Language Models (LLMs) demonstrated advanced multilingual capabilities.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found