Machine Translation
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview
Starting from the 1950s, Machine Translation (MT) was challenged by different scientific solutions, which included rule-based methods, example-based and statistical models (SMT), to hybrid models, and very recent years the neural models (NMT). While NMT has achieved a huge quality improvement in comparison to conventional methodologies, by taking advantage of a huge amount of parallel corpora available from the internet and the recently developed super computational power support with an acceptable cost, it struggles to achieve real human parity in many domains and most language pairs, if not all of them. Alongside the long road of MT research and development, quality evaluation metrics played very important roles in MT advancement and evolution. In this tutorial, we overview the traditional human judgement criteria, automatic evaluation metrics, unsupervised quality estimation models, as well as the meta-evaluation of the evaluation methods. Among these, we will also cover the very recent work in the MT evaluation (MTE) fields, taking advantage of the large size of pre-trained language models for automatic metric customisation towards exactly deployed language pairs and domains. In addition, we also introduce the statistical confidence estimation regarding the sample size needed for human evaluation in real practice simulation. Full tutorial material is \textbf{available} to download at https://github.com/poethan/LREC22_MetaEval_Tutorial.
Cross-lingual Opinions and Emotions Mining in Comparable Documents
Saad, Motaz, Langlois, David, Smaili, Kamel
Comparable texts are topic-aligned documents in multiple languages that are not direct translations. They are valuable for understanding how a topic is discussed across languages. This research studies differences in sentiments and emotions across English-Arabic comparable documents. First, texts are annotated with sentiment and emotion labels. We apply a cross-lingual method to label documents with opinion classes (subjective/objective), avoiding reliance on machine translation. To annotate with emotions (anger, disgust, fear, joy, sadness, surprise), we manually translate the English WordNet-Affect (WNA) lexicon into Arabic, creating bilingual emotion lexicons used to label the comparable corpora. We then apply a statistical measure to assess the agreement of sentiments and emotions in each source-target document pair. This comparison is especially relevant when the documents originate from different sources. To our knowledge, this aspect has not been explored in prior literature. Our study includes English-Arabic document pairs from Euronews, BBC, and Al-Jazeera (JSC). Results show that sentiment and emotion annotations align when articles come from the same news agency and diverge when they come from different ones. The proposed method is language-independent and generalizable to other language pairs.
RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation
Li, Tianjiao, Yu, Mengran, Shi, Chenyu, Zhao, Yanjun, Liu, Xiaojing, Zhang, Qiang, Zhang, Qi, Huang, Xuanjing, Wang, Jiayin
Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.
ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations
This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/) 2 R. Al-Sabbagh / Data in Brief 54 (2024) 1 10271 Subject Computer Science, Social Sciences Specific subject area Natural Language Processing, machine translation, large-language models, translation studies, cross-linguistic analysis, lexical semantics Data format Translated and aligned Type of data Texts (Bilingual tables in Microsoft Excel files) Data collection The ArzEn-MultiGenre dataset consists of three genres: song lyrics, novels, and subtitles. The data was gathered from various sources using different methods. A website was crawled for song lyrics using an in-house web crawler, and professional translators manually translated the lyrics into English. For novels, hard copies were collected in English and Egyptian Arabic, then scanned and converted into text files using an Optical Character Recognizer (OCR). The OCR output was then manually reviewed and aligned.
Building and Aligning Comparable Corpora
Saad, Motaz, Langlois, David, Smaili, Kamel
Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several corpora show that the Cross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure. Finally, we collect English and Arabic news documents from the British Broadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively. Then we use the CL-LSI similarity measure to automatically align comparable documents of BBC and JSC. The evaluation of the alignment shows that CL-LSI is not only able to align cross-lingual documents at the topic level, but also it is able to do this at the event level.
A French Version of the OLDI Seed Corpus
Marmonier, Malik, Sagot, Benoรฎt, Bawden, Rachel
We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.
SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System
Sibaee, Serry, Nacar, Omer, Al-Habashi, Yasser, Ammar, Adel, Boulila, Wadii
The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.
Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders
Alastruey, Belen, Janeiro, Joรฃo Maria, Allauzen, Alexandre, Elbayad, Maha, Barrault, Loรฏc, Costa-jussร , Marta R.
In this paper, we present a comprehensive study of language interference in encoder-only Transformer models across 83 languages. We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs, providing a large-scale quantification of cross-lingual interference. Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics, such as language family, nor with proxies like embedding similarity, but instead better relate to script. Finally, we demonstrate that the interference matrix effectively predicts performance on downstream tasks, serving as a tool to better design multilingual models to obtain optimal performance.
Semiotic Complexity and Its Epistemological Implications for Modeling Culture
Stine, Zachary K., Deitrick, James E.
The use of computational methods in the study of cultural artifacts--from models like linear regression and artificial neural networks, to how we evaluate and interpret those models--can be usefully understood as a kind of translation work from a complex, cultural medium into a formal, computational medium. Research questions arise in the cultural domain within culturally-embedded minds. When a researcher designs a computational model to aid in answering such a question, they translate from the cultural into the computational in each modeling decision they make. After completing this first translation problem, the researcher then makes use of the model by interpreting it (either directly or in downstream outputs that depend on it), requiring a second translation to be made, now from the computational going back into the cultural, by way of culturally-embedded researchers making sense of them. In these bidirectional translation problems, we as researchers want to ensure that our translations are reasonable, that they can be sufficiently evaluated and understood by others engaged in collective knowledge-building. Yet translation work can vary in the complexity required to interpret and evaluate it. Consider, for example, how evaluating a translation of "hello" into modern Mandarin Chinese is much simpler than evaluating a translation of a text from classical (i.e., literary) Chinese, like the Zhuangzi, into This preprint article is currently under review.
Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning
Doghmash, Salam Thabet, Saad, Motaz
Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.