Goto

Collaborating Authors

 human translation


Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models

Browder, Michael, Duh, Kevin, Harris, J. David, Lyzinski, Vince, McNamee, Paul, Park, Youngser, Priebe, Carey E., Viechnicki, Peter

arXiv.org Machine Learning

Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.


Missing the human touch? A computational stylometry analysis of GPT-4 translations of online Chinese literature

Yao, Xiaofang, Kang, Yong-Bin, McCosker, Anthony

arXiv.org Artificial Intelligence

Existing research suggests that machine translations of literary texts remain unsatisfactory. Such quality assessment often relies on automated metrics and subjective human ratings, with little attention to the stylistic features of machine translation. Empirical evidence is also scant on whether the advent of AI will transform the literary translation landscape, with implications for other critical domains for translation such as creative industries more broadly. This pioneering study investigates the stylistic features of AI translations, specifically examining GPT -4's performance against human translations in a Chinese online literature task. Our computational stylometry analysis reveals that GPT -4 translations closely mirror human translations in lexical, syntactic and content features. As such, AI translations can in fact replicate the'human touch' in literary translation style. The study provides critical insights into the implications of AI on literary translation in the posthuman, where the line between machine and human translations may become increasingly blurry.


LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering

Zhang, Ran, Zhao, Wei, Macken, Lieve, Eger, Steffen

arXiv.org Artificial Intelligence

The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics for literature prioritize mechanical accuracy over artistic expression and tend to overrate machine translation as being superior to human translation from experienced professionals. In the long run, this bias could result in an irreversible decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce LITRANSPROQA, a novel, reference-free, LLM-based question-answering framework designed for literary translation evaluation. LITRANSPROQA integrates humans in the loop to incorporate insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, LITRANSPROQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation and surpassing the best state-of-the-art metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, LITRANSPROQA reaches an adequacy performance comparable to trained linguistic student evaluators, though it still falls behind experienced professional translators. LITRANSPROQA shows broad applicability to open-source models like LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free tool for evaluating literary translations that require local processing due to copyright or ethical considerations.


Contextual effects of sentiment deployment in human and machine translation

Comstock, Lindy, Sharma, Priyanshu, Belov, Mikhail

arXiv.org Artificial Intelligence

This paper illustrates how the overall sentiment of a text may be shifted in translation and the implications for automated sentiment analyses, particularly those that utilize machine translation and assess findings via semantic similarity metrics. While human and machine translation will produce more lemmas that fit the expected frequency of sentiment in the target language, only machine translation will also reduce the overall semantic field of the text, particularly in regard to words with epistemic content.


WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects

Deutsch, Daniel, Briakou, Eleftheria, Caswell, Isaac, Finkelstein, Mara, Galor, Rebecca, Juraska, Juraj, Kovacs, Geza, Lui, Alison, Rei, Ricardo, Riesa, Jason, Rijhwani, Shruti, Riley, Parker, Salesky, Elizabeth, Trabelsi, Firas, Winkler, Stephanie, Zhang, Biao, Freitag, Markus

arXiv.org Artificial Intelligence

As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages and dialects in addition to post-edits of the references in 8 out of 9 languages in the original WMT24 dataset. The dataset covers four domains: literary, news, social, and speech. We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages. These results should be confirmed using a human-based evaluation, which we leave for future work.


Characterizing the Effects of Translation on Intertextuality using Multilingual Embedding Spaces

McGovern, Hope, Sirin, Hale, Lippincott, Tom

arXiv.org Artificial Intelligence

Rhetorical devices are difficult to translate, but they are crucial to the translation of literary documents. We investigate the use of multilingual embedding spaces to characterize the preservation of intertextuality, one common rhetorical device, across human and machine translation. To do so, we use Biblical texts, which are both full of intertextual references and are highly translated works. We provide a metric to characterize intertextuality at the corpus level and provide a quantitative analysis of the preservation of this rhetorical device across extant human translations and machine-generated counterparts. We go on to provide qualitative analysis of cases wherein human translations over- or underemphasize the intertextuality present in the text, whereas machine translations provide a neutral baseline. This provides support for established scholarship proposing that human translators have a propensity to amplify certain literary characteristics of the original manuscripts.


A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls

Shafayat, Sheikh, Yoon, Dongkeun, Jang, Woori, Choi, Jiwoo, Oh, Alice, Jung, Seohyon

arXiv.org Artificial Intelligence

In this work, we propose and evaluate the feasibility of a two-stage pipeline to evaluate literary machine translation, in a fine-grained manner, from English to Korean. The results show that our framework provides fine-grained, interpretable metrics suited for literary translation and obtains a higher correlation with human judgment than traditional machine translation metrics. Nonetheless, it still fails to match interhuman agreement, especially in metrics like Korean Honorifics. We also observe that LLMs tend to favor translations generated by other LLMs, and we highlight the necessity of developing more sophisticated evaluation methods to ensure accurate and culturally sensitive machine translation of literary works. Figure 1: The overview of our proposed framework: we evaluate translation of literary works in two stages.


Can ChatGPT capture swearing nuances? Evidence from translating Arabic oaths

Shormani, Mohammed Q.

arXiv.org Artificial Intelligence

This study sets out to answer one major question: Can ChatGPT capture swearing nuances? It presents an empirical study on the ability of ChatGPT to translate Arabic oath expressions into English. 30 Arabic oath expressions were collected from the literature. These 30 oaths were first translated via ChatGPT and then analyzed and compared to the human translation in terms of types of gaps left unfulfilled by ChatGPT. Specifically, the gaps involved are: religious gap, cultural gap, both religious and cultural gaps, no gap, using non-oath particles, redundancy and noncapturing of Arabic script diacritics. It concludes that ChatGPT translation of oaths is still much unsatisfactory, unveiling the need of further developments of ChatGPT, and the inclusion of Arabic data on which ChatGPT should be trained including oath expressions, oath nuances, rituals, and practices.


Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels

Yan, Jianhao, Yan, Pingchuan, Chen, Yulong, Li, Jing, Zhu, Xianchao, Zhang, Yue

arXiv.org Artificial Intelligence

This study presents a comprehensive evaluation of GPT-4's translation capabilities compared to human translators of varying expertise levels. Through systematic human evaluation using the MQM schema, we assess translations across three language pairs (Chinese$\longleftrightarrow$English, Russian$\longleftrightarrow$English, and Chinese$\longleftrightarrow$Hindi) and three domains (News, Technology, and Biomedical). Our findings reveal that GPT-4 achieves performance comparable to junior-level translators in terms of total errors, while still lagging behind senior translators. Unlike traditional Neural Machine Translation systems, which show significant performance degradation in resource-poor language directions, GPT-4 maintains consistent translation quality across all evaluated language pairs. Through qualitative analysis, we identify distinctive patterns in translation approaches: GPT-4 tends toward overly literal translations and exhibits lexical inconsistency, while human translators sometimes over-interpret context and introduce hallucinations. This study represents the first systematic comparison between LLM and human translators across different proficiency levels, providing valuable insights into the current capabilities and limitations of LLM-based translation systems.


How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs

Zhang, Ran, Zhao, Wei, Eger, Steffen

arXiv.org Artificial Intelligence

Recent research has focused on literary machine translation (MT) as a new challenge in MT. However, the evaluation of literary MT remains an open problem. We contribute to this ongoing discussion by introducing LITEVAL-CORPUS, a paragraph-level parallel corpus comprising multiple verified human translations and outputs from 9 MT systems, which totals over 2k paragraphs and includes 13k annotated sentences across four language pairs, costing 4.5k Euro. This corpus enables us to (i) examine the consistency and adequacy of multiple annotation schemes, (ii) compare evaluations by students and professionals, and (iii) assess the effectiveness of LLM-based metrics. We find that Multidimensional Quality Metrics (MQM), as the de facto standard in non-literary human MT evaluation, is inadequate for literary translation: While Best-Worst Scaling (BWS) with students and Scalar Quality Metric (SQM) with professional translators prefer human translations at rates of ~82% and ~94%, respectively, MQM with student annotators prefers human professional translations over the translations of the best-performing LLMs in only ~42% of cases. While automatic metrics generally show a moderate correlation with human MQM and SQM, they struggle to accurately identify human translations, with rates of at most ~20%. Our overall evaluation indicates that human professional translations consistently outperform LLM translations, where even the most recent LLMs tend to produce more literal and less diverse translations compared to human translations. However, newer LLMs such as GPT-4o perform substantially better than older ones.