Eriguchi, Akiko
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
Han, HyoJung, Eriguchi, Akiko, Xu, Haoran, Hoang, Hieu, Carpuat, Marine, Khayrallah, Huda
Vocabulary adaptation, which integrates new vocabulary into pre-trained language models (LMs), enables expansion to new languages and mitigates token overfragmentation. However, existing approaches are limited by their reliance on heuristic or external embeddings. We propose VocADT, a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model's weights fixed. VocADT offers a flexible and scalable solution without requiring external resources or language constraints. Across 11 languages--with various scripts, resource availability, and fragmentation--we demonstrate that VocADT outperforms the original Mistral model (Jiang et al., 2023) and other baselines across various multilingual tasks. We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation. We further finetune the adapted model on the generative task of machine translation and find that vocabulary adaptation is still beneficial after fine-tuning and that VocADT is the most effective method. Vocabulary adaptation (or transfer)--a process of modifying a pre-trained language model (LM) to use a new vocabulary--offers several key advantages.
X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale
Xu, Haoran, Murray, Kenton, Koehn, Philipp, Hoang, Hieu, Eriguchi, Akiko, Khayrallah, Huda
Large language models (LLMs) have achieved remarkable success across various NLP tasks, yet their focus has predominantly been on English due to Englishcentric pre-training and limited multilingual data. While some multilingual LLMs claim to support for hundreds of languages, models often fail to provide highquality response for mid-and low-resource languages, leading to imbalanced performance heavily skewed in favor of high-resource languages like English and Chinese. We prioritize quality over scaling number of languages, with a focus on multilingual machine translation task, and introduce X-ALMA, a model designed with to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels. This is achieved by plug-and-play languagespecific module architecture to prevent language conflicts during training and a carefully designed training regimen with novel optimization methods to maximize the translation performance. After the final stage of training regimen, our proposed Adaptive-Rejection Preference Optimization (ARPO) surpasses existing preference optimization methods in translation tasks. Large language models (LLMs) such as the GPT series (Brown et al., 2020; OpenAI, 2023), Mistral (Jiang et al., 2023), LLaMA series (Touvron et al., 2023a;b; Dubey et al., 2024), Gemma series (Team et al., 2024a;b), inter alia, among others, have demonstrated impressive performance across various NLP tasks. However, the efficacy of LLMs has primarily been evaluated on English tasks, with their multilingual capabilities receiving less attention due to the models being predominantly pre-trained on English and the scarcity of multilingual data. Recently, there has been a shift towards multilingual studies in LLMs. For instance, LLaMA 3 and 3.1 (Dubey et al., 2024) expand the vocabulary from 32K to 128K and pre-train on multilingual texts; รstรผn et al. (2024) have introduced Aya-101, a multilingual generative model supporting 101 languages; and BigTranslate (Yang et al., 2023) and LLaMAX (Lu et al., 2024) scale LLM-based multilingual translation models to over 100 languages. Despite the increased language support in LLMs, their performance across most languages falls short of practical application expectations, especially for mid-and low-resource languages (weakness 1). Work done during an internship at Microsoft.
Improving Multilingual Translation by Representation and Gradient Regularization
Yang, Yilin, Eriguchi, Akiko, Muzio, Alexandre, Tadepalli, Prasad, Lee, Stefan, Hassan, Hany
Multilingual Neural Machine Translation (NMT) enables one model to serve all translation directions, including ones that are unseen during training, i.e. zero-shot translation. Despite being theoretically attractive, current models often produce low quality translations -- commonly failing to even produce outputs in the right target language. In this work, we observe that off-target translation is dominant even in strong multilingual systems, trained on massive multilingual corpora. To address this issue, we propose a joint approach to regularize NMT models at both representation-level and gradient-level. At the representation level, we leverage an auxiliary target language prediction task to regularize decoder outputs to retain information about the target language. At the gradient level, we leverage a small amount of direct data (in thousands of sentence pairs) to regularize model gradients. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance by +5.59 and +10.38 BLEU on WMT and OPUS datasets respectively. Moreover, experiments show that our method also works well when the small amount of direct data is not available.
Neural Text Generation with Artificial Negative Examples
Shirai, Keisuke, Hashimoto, Kazuma, Eriguchi, Akiko, Ninomiya, Takashi, Mori, Shinsuke
Neural text generation models conditioning on given input (e.g. machine translation and image captioning) are usually trained by maximum likelihood estimation of target text. However, the trained models suffer from various types of errors at inference time. In this paper, we propose to suppress an arbitrary type of errors by training the text generation model in a reinforcement learning framework, where we use a trainable reward function that is capable of discriminating between references and sentences containing the targeted type of errors. We create such negative examples by artificially injecting the targeted errors to the references. In experiments, we focus on two error types, repeated and dropped tokens in model-generated text. The experimental results show that our method can suppress the generation errors and achieve significant improvements on two machine translation and two image captioning tasks.
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
Shen, Jonathan, Nguyen, Patrick, Wu, Yonghui, Chen, Zhifeng, Chen, Mia X., Jia, Ye, Kannan, Anjuli, Sainath, Tara, Cao, Yuan, Chiu, Chung-Cheng, He, Yanzhang, Chorowski, Jan, Hinsu, Smit, Laurenzo, Stella, Qin, James, Firat, Orhan, Macherey, Wolfgang, Gupta, Suyog, Bapna, Ankur, Zhang, Shuyuan, Pang, Ruoming, Weiss, Ron J., Prabhavalkar, Rohit, Liang, Qiao, Jacob, Benoit, Liang, Bowen, Lee, HyoukJoong, Chelba, Ciprian, Jean, Sรฉbastien, Li, Bo, Johnson, Melvin, Anil, Rohan, Tibrewal, Rajat, Liu, Xiaobing, Eriguchi, Akiko, Jaitly, Navdeep, Ari, Naveen, Cherry, Colin, Haghani, Parisa, Good, Otavio, Cheng, Youlong, Alvarez, Raziel, Caswell, Isaac, Hsu, Wei-Ning, Yang, Zongheng, Wang, Kuan-Chieh, Gonina, Ekaterina, Tomanek, Katrin, Vanik, Ben, Wu, Zelin, Jones, Llion, Schuster, Mike, Huang, Yanping, Chen, Dehao, Irie, Kazuki, Foster, George, Richardson, John, Macherey, Klaus, Bruguier, Antoine, Zen, Heiga, Raffel, Colin, Kumar, Shankar, Rao, Kanishka, Rybach, David, Murray, Matthew, Peddinti, Vijayaditya, Krikun, Maxim, Bacchiani, Michiel A. U., Jablin, Thomas B., Suderman, Rob, Williams, Ian, Lee, Benjamin, Bhatia, Deepti, Carlson, Justin, Yavuz, Semih, Zhang, Yu, McGraw, Ian, Galkin, Max, Ge, Qi, Pundak, Golan, Whipkey, Chad, Wang, Todd, Alon, Uri, Lepikhin, Dmitry, Tian, Ye, Sabour, Sara, Chan, William, Toshniwal, Shubham, Liao, Baohua, Nirschl, Michael, Rondon, Pat
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.
Zero-Shot Cross-lingual Classification Using Multilingual Neural Machine Translation
Eriguchi, Akiko, Johnson, Melvin, Firat, Orhan, Kazawa, Hideto, Macherey, Wolfgang
Transferring representations from large supervised tasks to downstream tasks has shown promising results in AI fields such as Computer Vision and Natural Language Processing (NLP). In parallel, the recent progress in Machine Translation (MT) has enabled one to train multilingual Neural MT (NMT) systems that can translate between multiple languages and are also capable of performing zero-shot translation. However, little attention has been paid to leveraging representations learned by a multilingual NMT system to enable zero-shot multilinguality in other NLP tasks. In this paper, we demonstrate a simple framework, a multilingual Encoder-Classifier, for cross-lingual transfer learning by reusing the encoder from a multilingual NMT system and stitching it with a task-specific classifier component. Our proposed model achieves significant improvements in the English setup on three benchmark tasks - Amazon Reviews, SST and SNLI. Further, our system can perform classification in a new language for which no classification data was seen during training, showing that zero-shot classification is possible and remarkably competitive. In order to understand the underlying factors contributing to this finding, we conducted a series of analyses on the effect of the shared vocabulary, the training data type for NMT, classifier complexity, encoder representation power, and model generalization on zero-shot performance. Our results provide strong evidence that the representations learned from multilingual NMT systems are widely applicable across languages and tasks.