Machine Translation
Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing
Wang, Weichuan, Li, Zhaoyi, Lian, Defu, Ma, Chen, Song, Linqi, Wei, Ying
Large Language Models (LLMs) have recently revolutionized the NLP field, while they still fall short in some specific down-stream tasks. In the work, we focus on utilizing LLMs to perform machine translation, where we observe that two patterns of errors frequently occur and drastically affect the translation quality: language mismatch and repetition. The work sets out to explore the potential for mitigating these two issues by leveraging model editing methods, e.g., by locating Feed-Forward Network (FFN) neurons or something that are responsible for the errors and deactivating them in the inference time. We find that directly applying such methods either limited effect on the targeted errors or has significant negative side-effect on the general translation quality, indicating that the located components may also be crucial for ensuring machine translation with LLMs on the rails. To this end, we propose to refine the located components by fetching the intersection of the locating results under different language settings, filtering out the aforementioned information that is irrelevant to targeted errors. The experiment results empirically demonstrate that our methods can effectively reduce the language mismatch and repetition ratios and meanwhile enhance or keep the general translation quality in most cases.
What do Large Language Models Need for Machine Translation Evaluation?
Qian, Shenbin, Sindhujan, Archchana, Kabra, Minnie, Kanojia, Diptesh, Orăsan, Constantin, Ranasinghe, Tharindu, Blain, Frédéric
Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.
Reviews: Deliberation Networks: Sequence Generation Beyond One-Pass Decoding
Two of my major concerns: the weakness of the baseline and the lack of comparison with automatic post-editing have been resolved by the response. I've raised my evaluation with the expectation that these results will be added to the final camera ready version. With regards to the examples, the reason why I said "cherry-picked?" (with a question mark) was because there was no mention of how the examples were chosen. If they were chosen randomly or some other unbiased method that could be noted in the paper. It's OK to cherry-pick representative examples, of course, and it'd be more clear if this was mentioned as well.
Reviews: Generative Neural Machine Translation
Summary This paper proposes a generative latent variable model for neural machine translation, where inference is performed with variational inference. This extends the work of Zhang et al., 2016, who proposed a conditional model with variational inference. The advantage of the generative model is to force the latent variable to capture more of the semantics of the sentence than the conditional model was able to do. The main disadvantage of this approach is that the value of the latent variable has to be infered during decoding (based on candidate generations). The paper also shows that a version of this model can be trained in a multilingual setting, that monolingual data can be used as semi-supervised training, and that the inference algorithm can be extended to perform translation with missing words.
Reviews: One-Shot Imitation Learning
Summary --- Complex and useful robotic manipulation tasks are difficult because of the difficulty of manipulation itself, but also because it's difficult to communicate the intent of a task. Both of these problems can be alleviated through the use of imitation learning, but in order for this to be practical the learner must be able to generalize from few examples. This paper presents an architecture inspired by recent work in meta learning which generalizes manipulation of a robot arm from a single task demonstration; i.e., it does one-shot imitation learning. The network is something like a seq2seq model that uses multiple attention mechanisms in the style of "Neural Machine Translation by Jointly Learning to Align and Translate". There is a demonstration network, a context network and a manipulation network.
Optimizing the Training Schedule of Multilingual NMT using Reinforcement Learning
Allemann, Alexis, Atrio, Àlex R., Popescu-Belis, Andrei
Multilingual NMT is a viable solution for translating low-resource languages (LRLs) when data from high-resource languages (HRLs) from the same language family is available. However, the training schedule, i.e. the order of presentation of languages, has an impact on the quality of such systems. Here, in a many-to-one translation setting, we propose to apply two algorithms that use reinforcement learning to optimize the training schedule of NMT: (1) Teacher-Student Curriculum Learning and (2) Deep Q Network. The former uses an exponentially smoothed estimate of the returns of each action based on the loss on monolingual or multilingual development subsets, while the latter estimates rewards using an additional neural network trained from the history of actions selected in different states of the system, together with the rewards received. On a 8-to-1 translation dataset with LRLs and HRLs, our second method improves BLEU and COMET scores with respect to both random selection of monolingual batches and shuffled multilingual batches, by adjusting the number of presentations of LRL vs. HRL batches.
Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?
Qian, Shenbin, Orăsan, Constantin, Kanojia, Diptesh, Carmo, Félix do
This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.
Edit Distances and Their Applications to Downstream Tasks in Research and Commercial Contexts
Carmo, Félix do, Kanojia, Diptesh
Edit distances are a class of metrics used to quantify the similarity between two text sequences by calculating the minimum number of operations required to transform one sequence into another. These operations typically include insertion, deletion, substitution, and movement of characters or words. The application of edit distances extends beyond simple string comparison and is used extensively in evaluating machinetranslated text against human references, quality estimation, and post-editing tasks. This tutorial is targeted at researchers of machine translation and of human translation, as well as corporate members of AMTA. It focuses on the uses of edit distances, such as TER - Translation Edit Rate (Snover et al., 2006), as proxies of translation effort and as informants of other downstream tasks, such as MT evaluation and post-editing, error annotation with MQM (Burchardt, 2013), quality estimation - QE (Specia et al., 2022) and automatic post-editing - APE (do Carmo et al., 2021). The application of edit distances in downstream tasks often assumes that these accurately represent work done by post-editors and real errors that need to be corrected in MT output. We will discuss how imperfect edit distances are in capturing the details of this error correction work and the implications for researchers and for commercial applications of these uses of edit distances. In terms of commercial applications, we will discuss their integration in computer-assisted translation tools and how the perception of the connection between edit distances and post-editor effort affects the definition of translator rates.
Post-edits Are Preferences Too
Berger, Nathaniel, Riezler, Stefan, Exel, Miriam, Huck, Matthias
Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, %$s_1 > s_2$; while for post-editing, editors create $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks.
Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval
Buettner, Kyle, Kovashka, Adriana
There is a scarcity of multilingual visionlanguage models that properly account for the perceptual differences that are reflected in image captions across languages and cultures. In this work, through a multimodal, multilingual retrieval case study, we quantify the existing lack of model flexibility. We empirically show Figure 1: Example perception differences between native performance gaps between training on captions English and German speakers. Examples are captions that come from native German perception from Flickr30K (Young et al., 2014) and Multi30K and captions that have been either machinetranslated (Elliott et al., 2016). Note differences in mentioned objects or human-translated from English ("sand arena", "parasol") and specificity ("Heurigen into German. To address these gaps, we further bench" vs. "table", "horse" vs. "bronco"). German propose and evaluate caption augmentation captions here are translated to English.