Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?

Kew, Tannon, Schottmann, Florian, Sennrich, Rico

arXiv.org Artificial Intelligence 

The vast majority of today's large language models are English-centric, having been pretrained predominantly on English text. Yet, in order to meet user expectations, models need to be able to respond appropriately in multiple languages once deployed in downstream applications. Given limited exposure to other languages during pretraining, crosslingual transfer is important for achieving decent performance in non-English settings. In this work, we investigate just how much multilinguality is required during finetuning to elicit strong cross-lingual generalisation across Figure 1: Input/output (IO) language agreement for a range of tasks and target languages. We find English (en), German (de), Bulgarian (bg) and Icelandic that, compared to English-only finetuning, multilingual (is) when instruction tuning on monolingual English instruction tuning with as few as three (Mono) or on multilingual data (Multi-Guanaco).