AITopics

1605.04515

Country:

North America > United States > Colorado (0.14)
Europe > United Kingdom > England (0.14)

Genre:

Instructional Material (1.00)
Overview (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Saad, Motaz, Langlois, David, Smaili, Kamel

Cross-lingual Opinions and Emotions Mining in Comparable Documents

arXiv.org Artificial IntelligenceAug-6-2025

Comparable texts are topic-aligned documents in multiple languages that are not direct translations. They are valuable for understanding how a topic is discussed across languages. This research studies differences in sentiments and emotions across English-Arabic comparable documents. First, texts are annotated with sentiment and emotion labels. We apply a cross-lingual method to label documents with opinion classes (subjective/objective), avoiding reliance on machine translation. To annotate with emotions (anger, disgust, fear, joy, sadness, surprise), we manually translate the English WordNet-Affect (WNA) lexicon into Arabic, creating bilingual emotion lexicons used to label the comparable corpora. We then apply a statistical measure to assess the agreement of sentiments and emotions in each source-target document pair. This comparison is especially relevant when the documents originate from different sources. To our knowledge, this aspect has not been explored in prior literature. Our study includes English-Arabic document pairs from Euronews, BBC, and Al-Jazeera (JSC). Results show that sentiment and emotion annotations align when articles come from the same news agency and diverge when they come from different ones. The proposed method is language-independent and generalizable to other language pairs.

artificial intelligence, machine learning, natural language, (20 more...)

2508.03112

Country:

Europe (1.00)
North America > United States > New York (0.28)

Genre: Research Report > New Finding (0.48)

Industry:

Media > News (0.67)
Media > Film (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(3 more...)

arXiv.org Artificial IntelligenceAug-6-2025

RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Li, Tianjiao, Yu, Mengran, Shi, Chenyu, Zhao, Yanjun, Liu, Xiaojing, Zhang, Qiang, Zhang, Qi, Huang, Xuanjing, Wang, Jiayin

Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.

large language model, machine learning, translation, (17 more...)

2506.0507

Country:

North America > United States (0.93)
Europe (0.67)

Genre: Research Report (0.82)

Industry:

Government (0.93)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

Al-Sabbagh, Rania

This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/) 2 R. Al-Sabbagh / Data in Brief 54 (2024) 1 10271 Subject Computer Science, Social Sciences Specific subject area Natural Language Processing, machine translation, large-language models, translation studies, cross-linguistic analysis, lexical semantics Data format Translated and aligned Type of data Texts (Bilingual tables in Microsoft Excel files) Data collection The ArzEn-MultiGenre dataset consists of three genres: song lyrics, novels, and subtitles. The data was gathered from various sources using different methods. A website was crawled for song lyrics using an in-house web crawler, and professional translators manually translated the lyrics into English. For novels, hard copies were collected in English and Egyptian Arabic, then scanned and converted into text files using an Optical Character Recognizer (OCR). The OCR output was then manually reviewed and aligned.

artificial intelligence, machine translation, natural language, (16 more...)

doi: 10.1016/j.dib.2024.110271

2508.01411

Country: Asia > Middle East > UAE (0.14)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Saad, Motaz, Langlois, David, Smaili, Kamel

Building and Aligning Comparable Corpora

Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several corpora show that the Cross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure. Finally, we collect English and Arabic news documents from the British Broadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively. Then we use the CL-LSI similarity measure to automatically align comparable documents of BBC and JSC. The evaluation of the alignment shows that CL-LSI is not only able to align cross-lingual documents at the topic level, but also it is able to do this at the event level.

data mining, machine learning, natural language, (21 more...)

2508.02555

Country:

Europe (1.00)
Asia > Middle East (0.93)
North America > United States (0.93)

Genre: Research Report (1.00)

Industry: Media > News (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Marmonier, Malik, Sagot, Benoît, Bawden, Rachel

A French Version of the OLDI Seed Corpus

We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.

artificial intelligence, natural language, translation, (15 more...)

2508.0229

Country:

Europe > United Kingdom (0.68)
Europe > France (0.67)
North America > United States > Oklahoma (0.28)
North America > Canada > Ontario (0.28)

Genre: Research Report (0.64)

Industry:

Media (0.93)
Education (0.68)
Leisure & Entertainment > Sports > Hockey (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System

Sibaee, Serry, Nacar, Omer, Al-Habashi, Yasser, Ammar, Adel, Boulila, Wadii

The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.

artificial intelligence, machine learning, natural language, (13 more...)

2508.02268

Country: Asia > Middle East > Syria (0.47)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Alastruey, Belen, Janeiro, João Maria, Allauzen, Alexandre, Elbayad, Maha, Barrault, Loïc, Costa-jussà, Marta R.

Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders

In this paper, we present a comprehensive study of language interference in encoder-only Transformer models across 83 languages. We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs, providing a large-scale quantification of cross-lingual interference. Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics, such as language family, nor with proxies like embedding similarity, but instead better relate to script. Finally, we demonstrate that the interference matrix effectively predicts performance on downstream tasks, serving as a tool to better design multilingual models to obtain optimal performance.

large language model, machine learning, natural language, (18 more...)

2508.02256

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Stine, Zachary K., Deitrick, James E.

Semiotic Complexity and Its Epistemological Implications for Modeling Culture

arXiv.org Artificial IntelligenceAug-4-2025

The use of computational methods in the study of cultural artifacts--from models like linear regression and artificial neural networks, to how we evaluate and interpret those models--can be usefully understood as a kind of translation work from a complex, cultural medium into a formal, computational medium. Research questions arise in the cultural domain within culturally-embedded minds. When a researcher designs a computational model to aid in answering such a question, they translate from the cultural into the computational in each modeling decision they make. After completing this first translation problem, the researcher then makes use of the model by interpreting it (either directly or in downstream outputs that depend on it), requiring a second translation to be made, now from the computational going back into the cultural, by way of culturally-embedded researchers making sense of them. In these bidirectional translation problems, we as researchers want to ensure that our translations are reasonable, that they can be sufficiently evaluated and understood by others engaged in collective knowledge-building. Yet translation work can vary in the complexity required to interpret and evaluate it. Consider, for example, how evaluating a translation of "hello" into modern Mandarin Chinese is much simpler than evaluating a translation of a text from classical (i.e., literary) Chinese, like the Zhuangzi, into This preprint article is currently under review.

artificial intelligence, machine learning, natural language, (18 more...)

2508.00095

Country:

North America > United States (0.68)
Europe > United Kingdom > England (0.28)

Genre:

Research Report > New Finding (0.50)
Research Report > Experimental Study (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Doghmash, Salam Thabet, Saad, Motaz

Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

arXiv.org Artificial IntelligenceAug-1-2025

Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.

arabic hate speech identification, machine learning, natural language, (19 more...)

2507.23661

Country:

Asia > Middle East > Palestine (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)