Machine Translation
Language Models: A Guide for the Perplexed
Serrano, Sofia, Brumbaugh, Zander, Smith, Noah A.
Given the growing importance of AI literacy, we decided to write this tutorial to help narrow the gap between the discourse among those who study language models -- the core technology underlying ChatGPT and similar products -- and those who are intrigued and want to learn more about them. In short, we believe the perspective of researchers and educators can add some clarity to the public's understanding of the technologies beyond what's currently available, which tends to be either extremely technical or promotional material generated about products by their purveyors. Our approach teases apart the concept of a language model from products built on them, from the behaviors attributed to or desired from those products, and from claims about similarity to human cognition. As a starting point, we (1) offer a scientific viewpoint that focuses on questions amenable to study through experimentation; (2) situate language models as they are today in the context of the research that led to their development; and (3) describe the boundaries of what is known about the models at this writing.
A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography
Aepli, Noëmi, Amrhein, Chantal, Schottmann, Florian, Sennrich, Rico
For sensible progress in natural language processing, it is important that we are aware of the limitations of the evaluation metrics we use. In this work, we evaluate how robust metrics are to non-standardized dialects, i.e. spelling differences in language varieties that do not have a standard orthography. To investigate this, we collect a dataset of human translations and human judgments for automatic machine translations from English to two Swiss German dialects. We further create a challenge set for dialect variation and benchmark existing metrics' performances. Our results show that existing metrics cannot reliably evaluate Swiss German text generation outputs, especially on segment level. We propose initial design adaptations that increase robustness in the face of non-standardized dialects, although there remains much room for further improvement. The dataset, code, and models are available here: https://github.com/textshuttle/dialect_eval
Evaluating Optimal Reference Translations
Zouhar, Vilém, Kloudová, Věra, Popel, Martin, Bojar, Ondřej
Machine translation (MT) is routinely evaluated using various segment-level similarity metrics against one or more reference translations. At the same time, reference translations acquired in the standard way are often criticized for their flaws of various types. For several high-resourced language pairs, MT quality reaches levels comparable to the quality of the reference translation (Freitag et al. 2022; Hassan et al. 2018) and sometimes MT even significantly surpasses humans in a particular evaluation setting (Popel et al. 2020). Given this, one could conclude that state-of-the-art MT has reached the point where reference-based evaluation is no longer reliable and we have to resort to other methods (such as targeted expert evaluation of particular outputs), even if they are costly, subjective and possibly impossible to automate. The narrow goal of the presented work is to allow for an "extension of the expiry date" for reference-based evaluation methods. In a broader perspective, we want to formulate a methodology for creating reference translations which avoid the often-observed deficiencies of "standard" or "professional" reference translations, be it multiple interfering phenomena, inappropriate expressions, ignorance of topic-focus articulation (information structure) or other abundant shortcomings in the translation, indicating their authors' insensitivity to the topic itself, but above all to the source and target language. To this end, we introduce so-called optimal reference translations (ORT), which are intended to represent optimal (ideal or excellent) human translations (should they be the subject of a translation quality evaluation).
The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation
Chappuis, Christel, Walt, Eliot, Mendez, Vincent, Lobry, Sylvain, Saux, Bertrand Le, Tuia, Devis
Remote sensing visual question answering (RSVQA) opens new opportunities for the use of overhead imagery by the general public, by enabling human-machine interaction with natural language. Building on the recent advances in natural language processing and computer vision, the goal of RSVQA is to answer a question formulated in natural language about a remote sensing image. Language understanding is essential to the success of the task, but has not yet been thoroughly examined in RSVQA. In particular, the problem of language biases is often overlooked in the remote sensing community, which can impact model robustness and lead to wrong conclusions about the performances of the model. Thus, the present work aims at highlighting the problem of language biases in RSVQA with a threefold analysis strategy: visual blind models, adversarial testing and dataset analysis. This analysis focuses both on model and data. Moreover, we motivate the use of more informative and complementary evaluation metrics sensitive to the issue. The gravity of language biases in RSVQA is then exposed for all of these methods with the training of models discarding the image data and the manipulation of the visual input during inference. Finally, a detailed analysis of question-answer distribution demonstrates the root of the problem in the data itself. Thanks to this analytical study, we observed that biases in remote sensing are more severe than in standard VQA, likely due to the specifics of existing remote sensing datasets for the task, e.g. geographical similarities and sparsity, as well as a simpler vocabulary and question generation strategies. While new, improved and less-biased datasets appear as a necessity for the development of the promising field of RSVQA, we demonstrate that more informed, relative evaluation metrics remain much needed to transparently communicate results of future RSVQA methods.
Graphmax for Text Generation
Liu, Bin (a:1:{s:5:"en_US";s:48:"Southwestern University of Finance and Economics";}) | Yin, Guosheng
In text generation, a large language model (LM) makes a choice of each new word based only on the former selection of its context using the softmax function. Nevertheless, the link statistics information of concurrent words based on a scene-specific corpus is valuable in choosing the next word, which can help to ensure the topic of the generated text to be aligned with the current task. To fully explore the co-occurrence information, we propose a graphmax function for task-specific text generation. Using the graph-based regularization, graphmax enables the final word choice to be determined by both the global knowledge from the LM and the local knowledge from the scene-specific corpus. The traditional softmax function is regularized with a graph total variation (GTV) term, which incorporates the local knowledge into the LM and encourages the model to consider the statistical relationships between words in a scene-specific corpus. The proposed graphmax is versatile and can be readily plugged into any large pre-trained LM for text generation and machine translation. Through extensive experiments, we demonstrate that the new GTV-based regularization can improve performances in various natural language processing (NLP) tasks in comparison with existing methods. Moreover, through human experiments, we observe that participants can easily distinguish the text generated by graphmax or softmax.
Content-Localization based System for Analyzing Sentiment and Hate Behaviors in Low-Resource Dialectal Arabic: English to Levantine and Gulf
Alzamzami, Fatimah, Saddik, Abdulmotaleb El
Even though online social movements can quickly become viral on social media, languages can be a barrier to timely monitoring and analyzing the underlying online social behaviors (OSB). This is especially true for under-resourced languages on social media like dialectal Arabic; the primary language used by Arabs on social media. Therefore, it is crucial to provide solutions to efficiently exploit resources from high-resourced languages to solve language-dependent OSB analysis in under-resourced languages. This paper proposes to localize content of resources in high-resourced languages into under-resourced Arabic dialects. Content localization goes beyond content translation that converts text from one language to another; content localization adapts culture, language nuances and regional preferences from one language to a specific language/dialect. Automating understanding of the natural and familiar day-to-day expressions in different regions, is the key to achieve a wider analysis of OSB especially for smart cities. In this paper, we utilize content-localization based neural machine translation to develop sentiment and hate classifiers for two low-resourced Arabic dialects: Levantine and Gulf. Not only this but we also leverage unsupervised learning to facilitate the analysis of sentiment and hate predictions by inferring hidden topics from the corresponding data and providing coherent interpretations of those topics in their native language/dialects. The experimental evaluations and proof-of-concept COVID-19 case study on real data have validated the effectiveness of our proposed system in precisely distinguishing sentiments and accurately identifying hate content in both Levantine and Gulf Arabic dialects. Our findings shed light on the importance of considering the unique nature of dialects within the same language and ignoring the dialectal aspect would lead to misleading analysis.
Reducing Gender Bias in Machine Translation through Counterfactual Data Generation
Naik, Ranjita, Rarrick, Spencer, Chowdhary, Vishal
Recent advances in neural methods have led to substantial improvement in the quality of Neural Machine Translation (NMT) systems. However, these systems frequently produce translations with inaccurate gender (Stanovsky et al., 2019), which can be traced to bias in training data. Saunders and Byrne (2020) tackle this problem with a handcrafted dataset containing balanced gendered profession words. By using this data to fine-tune an existing NMT model, they show that gender bias can be significantly mitigated, albeit at the expense of translation quality due to catastrophic forgetting. They recover some of the lost quality with modified training objectives or additional models at inference. We find, however, that simply supplementing the handcrafted dataset with a random sample from the base model training corpus is enough to significantly reduce the catastrophic forgetting. We also propose a novel domain-adaptation technique that leverages in-domain data created with the counterfactual data generation techniques proposed by Zmigrod et al. (2019) to further improve accuracy on the WinoMT challenge test set without significant loss in translation quality. We show its effectiveness in NMT systems from English into three morphologically rich languages French, Spanish, and Italian. The relevant dataset and code will be available at Github.
Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs
Conia, Simone, Li, Min, Lee, Daniel, Minhas, Umar Farooq, Ilyas, Ihab, Li, Yunyao
Recent work in Natural Language Processing and Computer Vision has been using textual information -- e.g., entity names and descriptions -- available in knowledge graphs to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we introduce the novel task of automatic Knowledge Graph Enhancement (KGE) and perform a thorough investigation on bridging the gap in both the quantity and quality of textual information between English and non-English languages. More specifically, we: i) bring to light the problem of increasing multilingual coverage and precision of entity names and descriptions in Wikidata; ii) demonstrate that state-of-the-art methods, namely, Machine Translation (MT), Web Search (WS), and Large Language Models (LLMs), struggle with this task; iii) present M-NTA, a novel unsupervised approach that combines MT, WS, and LLMs to generate high-quality textual information; and, iv) study the impact of increasing multilingual coverage and precision of non-English textual information in Entity Linking, Knowledge Graph Completion, and Question Answering. As part of our effort towards better multilingual knowledge graphs, we also introduce WikiKGE-10, the first human-curated benchmark to evaluate KGE approaches in 10 languages across 7 language families.
Average Token Delay: A Duration-aware Latency Metric for Simultaneous Translation
Kano, Yasumasa, Sudoh, Katsuhito, Nakamura, Satoshi
Simultaneous translation is a task in which the translation begins before the end of an input speech segment. Its evaluation should be conducted based on latency in addition to quality, and for users, the smallest possible amount of latency is preferable. Most existing metrics measure latency based on the start timings of partial translations and ignore their duration. This means such metrics do not penalize the latency caused by long translation output, which delays the comprehension of users and subsequent translations. In this work, we propose a novel latency evaluation metric for simultaneous translation called \emph{Average Token Delay} (ATD) that focuses on the duration of partial translations. We demonstrate its effectiveness through analyses simulating user-side latency based on Ear-Voice Span (EVS). In our experiment, ATD had the highest correlation with EVS among baseline latency metrics under most conditions.
Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context
Rippeth, Elijah, Carpuat, Marine, Duh, Kevin, Post, Matt
Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.