Machine Translation
Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation
Guo, Jiaxin, Wei, Daimeng, Luo, Yuanchang, Chen, Xiaoyu, Wu, Zhanglin, Yang, Huan, Shang, Hengchao, Li, Zongyao, Rao, Zhiqiang, Yang, Jinlong, Yang, Hao
Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.
SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation
Imai, Saki, İnan, Mert, Sicilia, Anthony, Alikhani, Malihe
Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.
Exploring NLP Benchmarks in an Extremely Low-Resource Setting
The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.
Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation
Alabdullah, Abdullah, Han, Lifeng, Lin, Chenghua
Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday communication in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: (i) a comprehensive evaluation of training-free prompting techniques, and (ii) the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (LLMs) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. Ara-TEaR is designed as a three-stage self-refinement prompting process, targeting frequent meaning-transfer and adaptation errors in DA-MSA translation. In this evaluation, GPT-4o achieved the highest performance across all prompting settings. For fine-tuning LLMs, a quantized Gemma2-9B model achieved a chrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% chrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.
How Important is `Perfect' English for Machine Translation Prompts?
Schmidtová, Patrícia, Bafna, Niyati, Aycock, Seth, Vico, Gianluca, Kamzela, Wiktor, Hämmerl, Katharina, Zouhar, Vilém
Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs' performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quantitative analysis and qualitative insights into how the models respond to increasing noise in the user prompt. The prompt quality strongly affects the translation performance: With many errors, even a good prompt can underperform a minimal or poor prompt without errors. However, different noise types impact translation quality differently, with character-level and combined noisers degrading performance more than phrasal perturbations. Qualitative analysis reveals that lower prompt quality largely leads to poorer instruction following, rather than directly affecting translation quality itself. Further, LLMs can still translate in scenarios with overwhelming random noise that would make the prompt illegible to humans.
The Forgotten Code: Validating a Century-Old Translation System with AI
A pioneering rule-based mechanical translation system (precursor of modern RBMTs) was first presented in December 1929 by its inventor, Federico Pucci, who later published the full method in a book titled "Il traduttore meccanico ed il metodo per corrispondersi fra Europei conoscendo ciascuno solo la propria lingua: Parte I", in Salerno (Italy), in 1931. This study illustrates how AI breathes new life into the system of international keys and ideograms devised by Pucci to translate from/into any Romance language (at least as a first step). The methodology involves having the AIs retranslate, following Pucci's method, the two text excerpts originally translated in 1931 and clearly documented in his publication: a passage from Dante's La Vita Nuova, translated from Italian into French, and a passage from Voltaire's Zadig, translated from French into Italian. The result is notable: the two texts, translated 94 years apart using the same method--by Pucci in 1931 and by AIs in 2025--show a low average difference, with only minor variations observed. With Pucci's system thus validated, it became feasible to have the AIs reproduce the excerpts in English, Spanish, and German according to his method. The results were consistent, and Pucci--via Artificial Intelligence--was tasked with translating more modern and technical texts, thereby reviving, nearly a century later, an invention that had remained almost entirely unknown and never applied beyond its creator, now brought to wider attention and opened to possible experimentation. Such a demonstration would not only affirm Pucci's historical status but also place him among the precursors and intellectual contributors to machine translation, whose work merits examination alongside figures such as Troyanskij, Booth, and Weaver, with possible consequences for how the history of the field is understood.
chDzDT: Word-level morphology-aware language model for Algerian social media text
Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word- or subword-level approaches. To address this gap, we introduce chDzDT, a character-level pre-trained language model tailored for Algerian morphology. Unlike conventional PLMs that rely on token sequences, chDzDT is trained on isolated words. This design allows the model to encode morphological patterns robustly, without depending on token boundaries or standardized orthography. The training corpus draws from diverse sources, including YouTube comments, French, English, and Berber Wikipedia, as well as the Tatoeba project. It covers multiple scripts and linguistic varieties, resulting in a substantial pre-training workload. Our contributions are threefold: (i) a detailed morphological analysis of Algerian dialect using YouTube comments; (ii) the construction of a multilingual Algerian lexicon dataset; and (iii) the development and extensive evaluation of a character-level PLM as a morphology-focused encoder for downstream tasks. The proposed approach demonstrates the potential of character-level modeling for morphologically rich, low-resource dialects and lays a foundation for more inclusive and adaptable NLP systems.
Conditional Generative Adversarial Networks Based Inertial Signal Translation
The paper presents an approach in which inertial signals measured with a wrist-worn sensor (e.g., a smartwatch) are translated into those that would be recorded using a shoe-mounted sensor, enabling the use of state-of-the-art gait analysis methods. In the study, the signals are translated using Conditional Generative Adversarial Networks (GANs). Two different GAN versions are used for experimental verification: traditional ones trained using binary cross-entropy loss and Wasserstein GANs (WGANs). For the generator, two architectures, a convolutional autoencoder, and a convolutional U-Net, are tested. The experiment results have shown that the proposed approach allows for an accurate translation, enabling the use of wrist sensor inertial signals for efficient, every-day gait analysis.
CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation
Niketan, Nripesh, Shrivastva, Vaatsalya
We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, and Mojo, with Slang support in development. Our system consists of: language-specific lexers/parsers converting source code to ASTs, bidirectional CrossGL translation modules implementing ToCrossGLConverter classes for importing code and CodeGen classes for target generation, and comprehensive backend implementations handling full translation pipelines. We demonstrate effectiveness through comprehensive evaluation across programming domains, achieving successful compilation and execution across all supported backends. The universal IR design enables adding new languages with minimal effort, requiring only language-specific frontend/backend components. Our contributions include: (1) a unified IR capturing semantics of multiple programming paradigms, (2) a modular architecture enabling extensibility, (3) a comprehensive framework supporting GPU compute, graphics programming, and systems languages, and (4) empirical validation demonstrating practical viability of universal code translation. CrossTL represents a significant step toward language-agnostic programming, enabling write-once, deploy-everywhere development.
Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach
Yang, Han, Lan, Jian, Liu, Yihong, Schütze, Hinrich, Seidl, Thomas
Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.