AITopics

2508.0229

Country:

Europe > United Kingdom (0.68)
Europe > France (0.67)
North America > United States > Oklahoma (0.28)
North America > Canada > Ontario (0.28)

Genre: Research Report (0.64)

Industry:

Media (0.93)
Education (0.68)
Leisure & Entertainment > Sports > Hockey (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceAug-5-2025

SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System

Sibaee, Serry, Nacar, Omer, Al-Habashi, Yasser, Ammar, Adel, Boulila, Wadii

The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.

artificial intelligence, machine learning, natural language, (13 more...)

2508.02268

Country: Asia > Middle East > Syria (0.47)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Alastruey, Belen, Janeiro, João Maria, Allauzen, Alexandre, Elbayad, Maha, Barrault, Loïc, Costa-jussà, Marta R.

Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders

arXiv.org Artificial IntelligenceAug-5-2025

In this paper, we present a comprehensive study of language interference in encoder-only Transformer models across 83 languages. We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs, providing a large-scale quantification of cross-lingual interference. Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics, such as language family, nor with proxies like embedding similarity, but instead better relate to script. Finally, we demonstrate that the interference matrix effectively predicts performance on downstream tasks, serving as a tool to better design multilingual models to obtain optimal performance.

large language model, machine learning, natural language, (18 more...)

2508.02256

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Stine, Zachary K., Deitrick, James E.

Semiotic Complexity and Its Epistemological Implications for Modeling Culture

arXiv.org Artificial IntelligenceAug-4-2025

The use of computational methods in the study of cultural artifacts--from models like linear regression and artificial neural networks, to how we evaluate and interpret those models--can be usefully understood as a kind of translation work from a complex, cultural medium into a formal, computational medium. Research questions arise in the cultural domain within culturally-embedded minds. When a researcher designs a computational model to aid in answering such a question, they translate from the cultural into the computational in each modeling decision they make. After completing this first translation problem, the researcher then makes use of the model by interpreting it (either directly or in downstream outputs that depend on it), requiring a second translation to be made, now from the computational going back into the cultural, by way of culturally-embedded researchers making sense of them. In these bidirectional translation problems, we as researchers want to ensure that our translations are reasonable, that they can be sufficiently evaluated and understood by others engaged in collective knowledge-building. Yet translation work can vary in the complexity required to interpret and evaluate it. Consider, for example, how evaluating a translation of "hello" into modern Mandarin Chinese is much simpler than evaluating a translation of a text from classical (i.e., literary) Chinese, like the Zhuangzi, into This preprint article is currently under review.

artificial intelligence, machine learning, natural language, (18 more...)

2508.00095

Country:

North America > United States (0.68)
Europe > United Kingdom > England (0.28)

Genre:

Research Report > New Finding (0.50)
Research Report > Experimental Study (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Doghmash, Salam Thabet, Saad, Motaz

Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.

arabic hate speech identification, machine learning, natural language, (19 more...)

2507.23661

Country:

Asia > Middle East > Palestine (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Sandrini, Peter

The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.

large language model, machine learning, translation, (22 more...)

2507.23399

Country: Europe > Finland (0.28)

Genre: Research Report > New Finding (0.86)

Industry:

Information Technology > Security & Privacy (1.00)
Information Technology > Services (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Quality Evaluation of COBOL to Java Code Transformation

Froimovich, Shmulik, Gal, Raviv, Ibraheem, Wesam, Ziv, Avi

We present an automated evaluation system for assessing COBOL-to-Java code translation within IBM's watsonx Code Assistant for Z (WCA4Z). The system addresses key challenges in evaluating LLM-based translators, including model opacity and the complexity of translation quality assessment. Our approach combines analytic checkers with LLM-as-a-judge (LaaJ) techniques to deliver scalable, multi-faceted evaluations. The system supports continuous integration workflows, enables large-scale benchmarking, and reduces reliance on manual review. We describe the system architecture, evaluation strategies, and reporting mechanisms that provide actionable insights for developers and project managers, facilitating the evolution of high-quality, modernized codebases.

large language model, programming language, translation, (20 more...)

2507.23356

Country: Asia (0.46)

Genre: Research Report (0.50)

Industry:

Information Technology (0.51)
Education > Educational Technology > Educational Software (0.46)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Galiano-Jiménez, Aarón, Pérez-Ortiz, Juan Antonio, Sánchez-Martínez, Felipe, Sánchez-Cartagena, Víctor M.

Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model's output distribution holds valuable insights for the student, beyond the approximated mode obtained through beam search (the standard decoding method), and present Multi-Hypothesis Distillation (MHD), a sequence-level KD method that generates multiple translations for each source sentence. This provides a larger representation of the teacher model distribution and exposes the student model to a wider range of target-side prefixes. We leverage $n$-best lists from beam search to guide the student's learning and examine alternative decoding methods to address issues like low variability and the under-representation of infrequent tokens. For low-resource languages, our research shows that while sampling methods may slightly compromise translation quality compared to beam search based approaches, they enhance the generated corpora with greater variability and lexical richness. This ultimately improves student model performance and mitigates the gender bias amplification often associated with KD.

artificial intelligence, natural language, translation, (16 more...)

2507.21568

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Michigan (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.90)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.96)

Yetukuri, Jayanth, Khan, Ishita

Intent-Aware Neural Query Reformulation for Behavior-Aligned Product Search

arXiv.org Artificial IntelligenceJul-31-2025

Understanding and modeling buyer intent is a foundational challenge in optimizing search query reformulation within the dynamic landscape of e-commerce search systems. This work introduces a robust data pipeline designed to mine and analyze large-scale buyer query logs, with a focus on extracting fine-grained intent signals from both explicit interactions and implicit behavioral cues. Leveraging advanced sequence mining techniques and supervised learning models, the pipeline systematically captures patterns indicative of latent purchase intent, enabling the construction of a high-fidelity, intent-rich dataset. The proposed framework facilitates the development of adaptive query rewrite strategies by grounding reformulations in inferred user intent rather than surface-level lexical signals. This alignment between query rewriting and underlying user objectives enhances both retrieval relevance and downstream engagement metrics. Empirical evaluations across multiple product verticals demonstrate measurable gains in precision-oriented relevance metrics, underscoring the efficacy of intent-aware reformulation. Our findings highlight the value of intent-centric modeling in bridging the gap between sparse user inputs and complex product discovery goals, and establish a scalable foundation for future research in user-aligned neural retrieval and ranking systems.

artificial intelligence, machine learning, natural language, (20 more...)

2507.22213

Genre: Research Report (0.69)

Industry:

Information Technology > Services (0.38)
Materials > Metals & Mining (0.34)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.48)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.30)

arXiv.org Artificial IntelligenceJul-31-2025

Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law

He, Yanjin, Zeng, Qingkai, Jiang, Meng

Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf's law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf's law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection.

large language model, machine learning, natural language, (22 more...)

2507.22543

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)