Machine Translation
Don't Throw Away Data: Better Sequence Knowledge Distillation
Wang, Jun, Briakou, Eleftheria, Dadkhahi, Hamid, Agarwal, Rishabh, Cherry, Colin, Cohn, Trevor
A critical component in knowledge distillation is the means of coupling the teacher and student. The predominant sequence knowledge distillation method involves supervised learning of the student against teacher-decoded outputs, and is exemplified by the current state of the art, which incorporates minimum Bayes risk (MBR) decoding. In this paper we seek to integrate MBR more tightly in distillation training, specifically by using several high scoring MBR translations, rather than a single selected sequence, thus capturing a rich diversity of teacher outputs. Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods for both tasks and with varying model sizes. Additionally, we conduct a detailed analysis focusing on data efficiency and capacity curse aspects to elucidate MBR-n and explore its further potential.
Word Order in English-Japanese Simultaneous Interpretation: Analyses and Evaluation using Chunk-wise Monotonic Translation
Doi, Kosuke, Ko, Yuka, Makinae, Mana, Sudoh, Katsuhito, Nakamura, Satoshi
This paper analyzes the features of monotonic translations, which follow the word order of the source language, in simultaneous interpreting (SI). Word order differences are one of the biggest challenges in SI, especially for language pairs with significant structural differences like English and Japanese. We analyzed the characteristics of chunk-wise monotonic translation (CMT) sentences using the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset and identified some grammatical structures that make monotonic translation difficult in English-Japanese SI. We further investigated the features of CMT sentences by evaluating the output from the existing speech translation (ST) and simultaneous speech translation (simulST) models on the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset as well as on existing test sets. The results indicate the possibility that the existing SI-based test set underestimates the model performance. The results also suggest that using CMT sentences as references gives higher scores to simulST models than ST models, and that using an offline-based test set to evaluate the simulST models underestimates the model performance.
Mitigating Translationese in Low-resource Languages: The Storyboard Approach
Kuwanto, Garry, Urua, Eno-Abasi E., Amuok, Priscilla Amondi, Muhammad, Shamsuddeen Hassan, Aremu, Anuoluwapo, Otiende, Verrah, Nanyanga, Loice Emma, Nyoike, Teresiah W., Akpan, Aniefon D., Udouboh, Nsima Ab, Archibong, Idongesit Udeme, Moses, Idara Effiong, Ige, Ifeoluwatayo A., Ajibade, Benjamin, Awokoya, Olumide Benjamin, Abdulmumin, Idris, Aliyu, Saminu Mohammad, Iro, Ruqayya Nasir, Ahmad, Ibrahim Said, Smith, Deontae, Michaels, Praise-EL, Adelani, David Ifeoluwa, Wijaya, Derry Tanti, Andy, Anietie
Low-resource languages often face challenges in acquiring high-quality language data due to the reliance on translation-based methods, which can introduce the translationese effect. This phenomenon results in translated sentences that lack fluency and naturalness in the target language. In this paper, we propose a novel approach for data collection by leveraging storyboards to elicit more fluent and natural sentences. Our method involves presenting native speakers with visual stimuli in the form of storyboards and collecting their descriptions without direct exposure to the source text. We conducted a comprehensive evaluation comparing our storyboard-based approach with traditional text translation-based methods in terms of accuracy and fluency. Human annotators and quantitative metrics were used to assess translation quality. The results indicate a preference for text translation in terms of accuracy, while our method demonstrates worse accuracy but better fluency in the language focused.
Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP
Bakagianni, Juli, Pouli, Kanella, Gavriilidou, Maria, Pavlopoulos, John
Natural Language Processing (NLP) research has traditionally been predominantly focused on English, driven by the availability of resources, the size of the research community, and market demands. Recently, there has been a noticeable shift towards multilingualism in NLP, recognizing the need for inclusivity and effectiveness across diverse languages and cultures. Monolingual surveys have the potential to complement the broader trend towards multilingualism in NLP by providing foundational insights and resources necessary for effectively addressing the linguistic diversity of global communication. However, monolingual NLP surveys are extremely rare in literature. This study fills the gap by introducing a method for creating systematic and comprehensive monolingual NLP surveys. Characterized by a structured search protocol, it can be used to select publications and organize them through a taxonomy of NLP tasks. We include a classification of Language Resources (LRs), according to their availability, and datasets, according to their annotation, to highlight publicly-available and machine-actionable LRs. By applying our method, we conducted a systematic literature review of Greek NLP from 2012 to 2022, providing a comprehensive overview of the current state and challenges of Greek NLP research. We discuss the progress of Greek NLP and outline encountered Greek LRs, classified by availability and usability. As we show, our proposed method helps avoid common pitfalls, such as data leakage and contamination, and to assess language support per NLP task. We consider this systematic literature review of Greek NLP an application of our method that showcases the benefits of a monolingual NLP survey. Similar applications could be regard the myriads of languages whose progress in NLP lags behind that of well-supported languages.
Setting up the Data Printer with Improved English to Ukrainian Machine Translation
Paniv, Yurii, Chaplynskyi, Dmytro, Trynus, Nikita, Kyrylov, Volodymyr
To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning of a large pretrained language model with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences followed by a second phase of training using 17K examples selected by k-fold perplexity filtering on another dataset of higher quality. Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.
Towards Chapter-to-Chapter Context-Aware Literary Translation via Large Language Models
Jin, Linghao, An, Li, Ma, Xuezhe
Discourse phenomena in existing document-level translation datasets are sparse, which has been a fundamental obstacle in the development of context-aware machine translation models. Moreover, most existing document-level corpora and context-aware machine translation methods rely on an unrealistic assumption on sentence-level alignments. To mitigate these issues, we first curate a novel dataset of Chinese-English literature, which consists of 160 books with intricate discourse structures. Then, we propose a more pragmatic and challenging setting for context-aware translation, termed chapter-to-chapter (Ch2Ch) translation, and investigate the performance of commonly-used machine translation models under this setting. Furthermore, we introduce a potential approach of finetuning large language models (LLMs) within the domain of Ch2Ch literary translation, yielding impressive improvements over baselines. Through our comprehensive analysis, we unveil that literary translation under the Ch2Ch setting is challenging in nature, with respect to both model learning methods and translation decoding algorithms.
Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing
Fayyazsanavi, Pooya, Anastasopoulos, Antonios, Košecká, Jana
Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on Gloss2Text translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities Figure 1: An example of ambiguity in sign language is improving significantly the performance of demonstrated by the gloss "BEWOELKT (CLOUDY)," state-of-the-art approaches. Through extensive which is represented in multiple translations within the experiments and ablation studies on the dataset. As shown, ambiguity may share the same meaning PHOENIX Weather 2014T dataset, our approach but differ in form, such as "wolken (cloudy)," or surpasses state-of-the-art performance where the gloss represents the concept meaning, such in Gloss2Text translation, indicating its efficacy as "unbeständig (unstable)." in addressing sign language translation and suggesting promising avenues for future research and development.
ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
Heakl, Ahmed, Zaghloul, Youssef, Ali, Mennatullah, Hossam, Rania, Gomaa, Walid
Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of $56\%$ in English translation over the state-of-the-art and $9.3\%$ in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \url{http://github.com/ahmedheakl/arazn-llm}}, Models: \url{http://huggingface.co/collections/ahmedheakl/arazn-llm-662ceaf12777656607b9524e}.
Toucan: Many-to-Many Translation for 150 African Language Pairs
Elmadany, AbdelRahim, Adebara, Ife, Abdul-Mageed, Muhammad
We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.
Rule-Based, Neural and LLM Back-Translation: Comparative Insights from a Variant of Ladin
Frontull, Samuel, Moser, Georg
This paper explores the impact of different back-translation approaches on machine translation for Ladin, specifically the Val Badia variant. Given the limited amount of parallel data available for this language (only 18k Ladin-Italian sentence pairs), we investigate the performance of a multilingual neural machine translation model fine-tuned for Ladin-Italian. In addition to the available authentic data, we synthesise further translations by using three different models: a fine-tuned neural model, a rule-based system developed specifically for this language pair, and a large language model. Our experiments show that all approaches achieve comparable translation quality in this low-resource scenario, yet round-trip translations highlight differences in model performance.