Goto

Collaborating Authors

 Murray, Kenton


DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models

arXiv.org Artificial Intelligence

Most of the world's languages and dialects are low-resource, and lack support in mainstream machine translation (MT) models. However, many of them have a closely-related high-resource language (HRL) neighbor, and differ in linguistically regular ways from it. This underscores the importance of model robustness to dialectical variation and cross-lingual generalization to the HRL dialect continuum. We present DialUp, consisting of a training-time technique for adapting a pretrained model to dialectical data (M->D), and an inference-time intervention adapting dialectical data to the model expertise (D->M). M->D induces model robustness to potentially unseen and unknown dialects by exposure to synthetic data exemplifying linguistic mechanisms of dialectical variation, whereas D->M treats dialectical divergence for known target dialects. These methods show considerable performance gains for several dialects from four language families, and modest gains for two other language families. We also conduct feature and error analyses, which show that language varieties with low baseline MT performance are more likely to benefit from these approaches.


Merging Feed-Forward Sublayers for Compressed Transformers

arXiv.org Artificial Intelligence

With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques is growing in order to deploy these models widely. The sheer parameter count of these models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models, and test our method on language modeling, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers, and demonstrate improved performance over a strong layer-pruning baseline. For instance, we can remove over 21% of total parameters from a Vision Transformer, while maintaining 99% of its original performance. Additionally, we observe that some groups of feed-forward sublayers exhibit high activation similarity, which may help explain their surprising mergeability.


Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

arXiv.org Machine Learning

Data availability across domains often follows a long-tail distribution: a few domains have abundant data, while most face dat . a scarcity. This imbalance poses challenges in training language models uniformly across all domains. In our study, we focus on multilingual settings, where data sizes vary significantly between high- and low-resource languages. Common strategies to address this include upsampling low-resource languages (Temperature Sampling) or upweighting their loss (Scalarization). Although often considered equivalent, this assumption has not been proven, which motivates our study. Through both theoretical and empirical analysis, we identify the conditions under which these approaches are equivalent and when they diverge. Specifically, we demonstrate that these two methods are equivalent under full gradient descent, but this equivalence breaks down with stochastic gradient descent. Empirically, we observe that Temperature Sampling converges more quickly but is prone to overfitting. We argue that this faster convergence is likely due to the lower variance in gradient estimations, as shown theoretically. Based on these insights, we propose Cooldown, a strategy that reduces sampling temperature during training, accelerating convergence without overfitting to low-resource languages. Our method is competitive with existing data re-weighting and offers computational efficiency.


Findings of the IWSLT 2024 Evaluation Campaign

arXiv.org Artificial Intelligence

This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 18 teams whose submissions are documented in 26 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.


MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval

arXiv.org Artificial Intelligence

Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce $\textbf{MultiVENT 2.0}$, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation tasks.


X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved remarkable success across various NLP tasks, yet their focus has predominantly been on English due to Englishcentric pre-training and limited multilingual data. While some multilingual LLMs claim to support for hundreds of languages, models often fail to provide highquality response for mid-and low-resource languages, leading to imbalanced performance heavily skewed in favor of high-resource languages like English and Chinese. We prioritize quality over scaling number of languages, with a focus on multilingual machine translation task, and introduce X-ALMA, a model designed with to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels. This is achieved by plug-and-play languagespecific module architecture to prevent language conflicts during training and a carefully designed training regimen with novel optimization methods to maximize the translation performance. After the final stage of training regimen, our proposed Adaptive-Rejection Preference Optimization (ARPO) surpasses existing preference optimization methods in translation tasks. Large language models (LLMs) such as the GPT series (Brown et al., 2020; OpenAI, 2023), Mistral (Jiang et al., 2023), LLaMA series (Touvron et al., 2023a;b; Dubey et al., 2024), Gemma series (Team et al., 2024a;b), inter alia, among others, have demonstrated impressive performance across various NLP tasks. However, the efficacy of LLMs has primarily been evaluated on English tasks, with their multilingual capabilities receiving less attention due to the models being predominantly pre-trained on English and the scarcity of multilingual data. Recently, there has been a shift towards multilingual studies in LLMs. For instance, LLaMA 3 and 3.1 (Dubey et al., 2024) expand the vocabulary from 32K to 128K and pre-train on multilingual texts; Üstün et al. (2024) have introduced Aya-101, a multilingual generative model supporting 101 languages; and BigTranslate (Yang et al., 2023) and LLaMAX (Lu et al., 2024) scale LLM-based multilingual translation models to over 100 languages. Despite the increased language support in LLMs, their performance across most languages falls short of practical application expectations, especially for mid-and low-resource languages (weakness 1). Work done during an internship at Microsoft.


Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models

arXiv.org Artificial Intelligence

With Retrieval Augmented Generation (RAG), Large Language Models (LLMs) are playing a pivotal role in information search and are being adopted globally. Although the multilingual capability of LLMs offers new opportunities to bridge the language barrier, do these capabilities translate into real-life scenarios where linguistic divide and knowledge conflicts between multilingual sources are known occurrences? In this paper, we studied LLM's linguistic preference in a RAG-based information search setting. We found that LLMs displayed systemic bias towards information in the same language as the query language in both information retrieval and answer generation. Furthermore, in scenarios where there is little information in the language of the query, LLMs prefer documents in high-resource languages, reinforcing the dominant views. Such bias exists for both factual and opinion-based queries. Our results highlight the linguistic divide within multilingual LLMs in information search systems. The seemingly beneficial multilingual capability of LLMs may backfire on information parity by reinforcing language-specific information cocoons or filter bubbles further marginalizing low-resource views.


Krey\`ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

arXiv.org Artificial Intelligence

A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 26 of 34 translation directions.


Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

arXiv.org Artificial Intelligence

Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.


Narrowing the Gap between Zero- and Few-shot Machine Translation by Matching Styles

arXiv.org Artificial Intelligence

Large language models trained primarily in a monolingual setting have demonstrated their ability to generalize to machine translation using zero- and few-shot examples with in-context learning. However, even though zero-shot translations are relatively good, there remains a discernible gap comparing their performance with the few-shot setting. In this paper, we investigate the factors contributing to this gap and find that this gap can largely be closed (for about 70%) by matching the writing styles of the target corpus. Additionally, we explore potential approaches to enhance zero-shot baselines without the need for parallel demonstration examples, providing valuable insights into how these methods contribute to improving translation metrics.