Goto

Collaborating Authors

 Machine Translation


Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact

arXiv.org Artificial Intelligence

Multilingual Large Language Models (MLLMs) represent a pivotal advancement in democratizing artificial intelligence across linguistic boundaries. While theoretical foundations are well-established, practical implementation guidelines remain scattered. This work bridges this gap by providing a comprehensive end-to-end framework for developing and deploying MLLMs in production environments. We make three distinctive contributions: First, we present an actionable pipeline from data pre-processing through deployment, integrating insights from academic research and industrial applications. Second, using Llama2 as a case study, we provide detailed optimization strategies for enhancing multilingual capabilities, including curriculum learning approaches for balancing high-resource and low-resource languages, tokenization strategies, and effective sampling methods. Third, we offer an interdisciplinary analysis that considers technical, linguistic, and cultural perspectives in MLLM development. Our findings reveal critical challenges in supporting linguistic diversity, with 88.38% of world languages categorized as low-resource, affecting over a billion speakers. We examine practical solutions through real-world applications in customer service, search engines, and machine translation. By synthesizing theoretical frameworks with production-ready implementation strategies, this survey provides essential guidance for practitioners and researchers working to develop more inclusive and effective multilingual AI systems.


Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!

arXiv.org Artificial Intelligence

Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability? In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.


Context-Aware LLM Translation System Using Conversation Summarization and Dialogue History

arXiv.org Artificial Intelligence

Translating conversational text, particularly in customer support contexts, presents unique challenges due to its informal and unstructured nature. We propose a context-aware LLM translation system that leverages conversation summarization and dialogue history to enhance translation quality for the English-Korean language pair. Our approach incorporates the two most recent dialogues as raw data and a summary of earlier conversations to manage context length effectively. We demonstrate that this method significantly improves translation accuracy, maintaining coherence and consistency across conversations. This system offers a practical solution for customer support translation tasks, addressing the complexities of conversational text.


Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

arXiv.org Artificial Intelligence

While NLP models significantly impact our lives, there are rising concerns about privacy invasion. Although federated learning enhances privacy, attackers may recover private training data by exploiting model parameters and gradients. Therefore, protecting against such embedding attacks remains an open challenge. To address this, we propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks, making input text recovery harder. Importantly, our method requires a smaller memory with $256$ bytes of vocabulary while keeping efficiency with the same input length. Thus, our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. Our experiments show SEB can effectively protect against embedding-based attacks from recovering original sentences in federated learning. Meanwhile, we verify that SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.


On Creating an English-Thai Code-switched Machine Translation in Medical Domain

arXiv.org Artificial Intelligence

Machine translation (MT) in the medical domain plays a pivotal role in enhancing healthcare quality and disseminating medical knowledge. Despite advancements in English-Thai MT technology, common MT approaches often underperform in the medical field due to their inability to precisely translate medical terminologies. Our research prioritizes not merely improving translation accuracy but also maintaining medical terminology in English within the translated text through code-switched (CS) translation. We developed a method to produce CS medical translation data, fine-tuned a CS translation model with this data, and evaluated its performance against strong baselines, such as Google Neural Machine Translation (NMT) and GPT-3.5/GPT-4. Our model demonstrated competitive performance in automatic metrics and was highly favored in human preference evaluations. Our evaluation result also shows that medical professionals significantly prefer CS translations that maintain critical English terms accurately, even if it slightly compromises fluency. Our code and test set are publicly available https://github.com/preceptorai-org/NLLB_CS_EM_NLP2024.


CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation

arXiv.org Artificial Intelligence

Simultaneous speech translation (SimulST) systems must balance translation quality with response time, making latency measurement crucial for evaluating their real-world performance. However, there has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches. We demonstrate that this issue affects not only streaming but also segment-level latency evaluation across different metrics. Furthermore, we propose a modification to correctly measure computation-aware latency for SimulST systems, addressing the limitations present in existing metrics.


Efficient Terminology Integration for LLM-based Translation in Specialized Domains

arXiv.org Artificial Intelligence

Traditional machine translation methods typically involve training models directly on large parallel corpora, with limited emphasis on specialized terminology. However, In specialized fields such as patent, finance, or biomedical domains, terminology is crucial for translation, with many terms that needs to be translated following agreed-upon conventions. In this paper we introduce a methodology that efficiently trains models with a smaller amount of data while preserving the accuracy of terminology translation. We achieve this through a systematic process of term extraction and glossary creation using the Trie Tree algorithm, followed by data reconstruction to teach the LLM how to integrate these specialized terms. This methodology enhances the model's ability to handle specialized terminology and ensures high-quality translations, particularly in fields where term consistency is crucial. Our approach has demonstrated exceptional performance, achieving the highest translation score among participants in the WMT patent task to date, showcasing its effectiveness and broad applicability in specialized translation domains where general methods often fall short.


Analyzing Context Contributions in LLM-based Machine Translation

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved state-of-the-art performance in machine translation (MT) and demonstrated the ability to leverage in-context learning through few-shot examples. However, the mechanisms by which LLMs use different parts of the input context remain largely unexplored. In this work, we provide a comprehensive analysis of context utilization in MT, studying how LLMs use various context parts, such as few-shot examples and the source text, when generating translations. We highlight several key findings: (1) the source part of few-shot examples appears to contribute more than its corresponding targets, irrespective of translation direction; (2) finetuning LLMs with parallel data alters the contribution patterns of different context parts; and (3) there is a positional bias where earlier few-shot examples have higher contributions to the translated sequence. Finally, we demonstrate that inspecting anomalous context contributions can potentially uncover pathological translations, such as hallucinations. Our findings shed light on the internal workings of LLM-based MT which go beyond those known for standard encoder-decoder MT models.


Learning from others' mistakes: Finetuning machine translation models with span-level error annotations

arXiv.org Artificial Intelligence

L EARNING FROM OTHERS ' MISTAKES: F INETUNING MACHINE TRANSLATION MODELS WITH SPAN-LEVEL ERROR ANNOTATIONS Lily H. Zhang 2 Hamid Dadkhahi 1 Mara Finkelstein 1 Firas Trabelsi 1 Jiaming Luo 1 Markus Freitag 1 1 Google 2 New Y ork University A BSTRACT Despite growing interest in incorporating feedback to improve language models, most efforts focus only on sequence-level annotations. In this work, we explore the potential of utilizing fine-grained span-level annotations from offline datasets to improve model quality. We develop a simple finetuning algorithm, called Training with Annotations (TW A), to directly train machine translation models on such annotated data. TW A utilizes targeted span-level error information while also flexibly learning what to penalize within a span. Moreover, TW A considers the overall trajectory of a sequence when deciding which non-error spans to utilize as positive signals. Experiments on English-German and Chinese-English machine translation show that TW A outperforms baselines such as Supervised FineTuning on sequences filtered for quality and Direct Preference Optimization on pairs constructed from the same data. Such data, coupled with techniques to learn from it (Christiano et al., 2017; Rafailov et al., 2023; Gulcehre et al., 2023; Dong et al., 2023), have yielded impressive results for many top language models. Most efforts, however, consider only sequence-level labels, usually in the form of a scalar score assigned to the entire output. In contrast, this work investigates the potential of using fine-grained span-level annotations from offline datasets to enhance language model training.


Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5

arXiv.org Artificial Intelligence

Morphology is a crucial factor for multilingual language modeling as it poses direct challenges for tokenization. Here, we seek to understand how tokenization influences the morphological knowledge encoded in multilingual language models. Specifically, we capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5. The two models share the same architecture, training objective, and training data and only differ in their tokenization strategies: subword tokenization vs.\@ character-level tokenization. Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that the models learn the morphological systems of some languages better than others and that morphological information is encoded in the middle and late layers. Finally, we show that languages with more irregularities benefit more from having a higher share of the pre-training data.