transfer language
Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
Ng, York Hay, Khan, Aditya, Lu, Xiang, Salloum, Matteo, Zhou, Michael, Hoang, Phuong H., Doğruöz, A. Seza, Lee, En-Shiun Annie
Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. One, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data, and two, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. In selecting transfer languages, our representations and composite distances consistently improve performance across a wide range of NLP tasks, providing a more principled and effective toolkit for multilingual research.
Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese
Kunz, Jenny, Debess, Iben Nyholm, Simonsen, Annika
We investigate how to adapt small, efficient LLMs to Faroese, a low-resource North Germanic language. Starting from English models, we continue pre-training on related Scandinavian languages, either individually or combined via merging, before fine-tuning on Faroese. We compare full fine-tuning with parameter-efficient tuning using LoRA, evaluating their impact on both linguistic accuracy and text comprehension. Due to the lack of existing Faroese evaluation data, we construct two new minimal-pair benchmarks from adapted and newly collected datasets and complement them with human evaluations by Faroese linguists. Our results demonstrate that transfer from related languages is crucial, though the optimal source language depends on the task: Icelandic enhances linguistic accuracy, whereas Danish boosts comprehension. Similarly, the choice between full fine-tuning and LoRA is task-dependent: LoRA improves linguistic acceptability and slightly increases human evaluation scores on the base model, while full fine-tuning yields stronger comprehension performance and better preserves model capabilities during downstream fine-tuning.
mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages
Nigatu, Hellina Hailu, Li, Min, ter Hoeve, Maartje, Potdar, Saloni, Chasins, Sarah
Knowledge Graphs represent real-world entities and the relationships between them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of automatically constructing or predicting missing entities and links for knowledge graphs in a multilingual setting. In this work, we reformulate the mKGC task as a Question Answering (QA) task and introduce mRAKL: a Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve this by using the head entity and linking relation in a question, and having our model predict the tail entity as an answer. Our experiments focus primarily on two low-resourced languages: Tigrinya and Amharic. We experiment with using higher-resourced languages Arabic and English for cross-lingual transfer. With a BM25 retriever, we find that the RAG-based approach improves performance over a no-context setting. Further, our ablation studies show that with an idealized retrieval system, mRAKL improves accuracy by 4.92 and 8.79 percentage points for Tigrinya and Amharic, respectively.
Untangling the Influence of Typology, Data and Model Architecture on Ranking Transfer Languages for Cross-Lingual POS Tagging
Rice, Enora, Marashian, Ali, Haynie, Hannah, von der Wense, Katharina, Palmer, Alexis
Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity, yet selecting a suitable transfer language remains a challenge. The precise roles of linguistic typology, training data, and model architecture in transfer language choice are not fully understood. We take a holistic approach, examining how both dataset-specific and fine-grained typological features influence transfer language selection for part-of-speech tagging, considering two different sources for morphosyntactic features. While previous work examines these dynamics in the context of bilingual biLSTMS, we extend our analysis to a more modern transfer learning pipeline: zero-shot prediction with pretrained multilingual models. We train a series of transfer language ranking systems and examine how different feature inputs influence ranker performance across architectures. Word overlap, type-token ratio, and genealogical distance emerge as top features across all architectures. Our findings reveal that a combination of typological and dataset-dependent features leads to the best rankings, and that good performance can be obtained with either feature group on its own.
An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models
Faisal, Fahim, Anastasopoulos, Antonios
The capacity and effectiveness of pre-trained multilingual models (MLMs) for zero-shot cross-lingual transfer is well established. However, phenomena of positive or negative transfer, and the effect of language choice still need to be fully understood, especially in the complex setting of massively multilingual LMs. We propose an efficient method to study transfer language influence in zero-shot performance on another target language. Unlike previous work, our approach disentangles downstream tasks from language, using dedicated adapter units. Our findings suggest that some languages do not largely affect others, while some languages, especially ones unseen during pre-training, can be extremely beneficial or detrimental for different target languages. We find that no transfer language is beneficial for all target languages. We do, curiously, observe languages previously unseen by MLMs consistently benefit from Figure 1: Our approach uses efficient few-step continued transfer from almost any language. We additionally tuning (left) and adapter modules (right) to disentangle use our modular approach to quantify the effect of task and language to quantify the effect negative interference efficiently and catagorize of a transfer language for a given task and model.
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
Adelani, David Ifeoluwa, Neubig, Graham, Ruder, Sebastian, Rijhwani, Shruti, Beukman, Michael, Palen-Michel, Chester, Lignos, Constantine, Alabi, Jesujoba O., Muhammad, Shamsuddeen H., Nabende, Peter, Dione, Cheikh M. Bamba, Bukula, Andiswa, Mabuya, Rooweither, Dossou, Bonaventure F. P., Sibanda, Blessing, Buzaaba, Happy, Mukiibi, Jonathan, Kalipe, Godson, Mbaye, Derguene, Taylor, Amelia, Kabore, Fatoumata, Emezue, Chris Chinenye, Aremu, Anuoluwapo, Ogayo, Perez, Gitau, Catherine, Munkoh-Buabeng, Edwin, Koagne, Victoire M., Tapo, Allahsera Auguste, Macucwa, Tebogo, Marivate, Vukosi, Mboning, Elvis, Gwadabe, Tajuddeen, Adewumi, Tosin, Ahia, Orevaoghene, Nakatumba-Nabende, Joyce, Mokono, Neo L., Ezeani, Ignatius, Chukwuneke, Chiamaka, Adeyemi, Mofetoluwa, Hacheme, Gilles Q., Abdulmumin, Idris, Ogundepo, Odunayo, Yousuf, Oreen, Ngoli, Tatiana Moteu, Klakow, Dietrich
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.
Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation
Joshi, Aviral, Huang, Chengzhi, Singh, Har Simrat
This work focuses on comparing different solutions for machine translation on low resource language pairs, namely, with zero-shot transfer learning and unsupervised machine translation. We discuss how the data size affects the performance of both unsupervised MT and transfer learning. Additionally we also look at how the domain of the data affects the result of unsupervised MT. The code to all the experiments performed in this project are accessible on Github.