AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

Bilingual Word Level Language Identification for Omotic Languages

Yigezu, Mesay Gemeda, Bade, Girma Yohannis, Tonja, Atnafu Lambebo, Kolesnikova, Olga, Sidorov, Grigori, Gelbukh, Alexander

arXiv.org Artificial IntelligenceSep-11-2025

Language identification is the task of determining the languages for a given text. In many real-world scenarios, text may contain more than one language, particularly in multilingual communities. Bilingual Language Identification (BLID) is the task of identifying and distinguishing between two languages in a given text. This paper presents BLID for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa. The presence of words' similarities and differences between the two languages makes the language identification task challenging. To overcome this challenge, we employed various experiments on various approaches. Then, the combination of the Bert-based pre-trained language model and LSTM approach performed better, with an F1-score of 0.72 on the test set. As a result, the work will be effective in tackling unwanted social media issues and providing a foundation for further research in this area.

experiment, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.07998

Country:

North America > Mexico (0.47)
Africa > Ethiopia (0.36)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)

Add feedback

Hunyuan-MT Technical Report

Zheng, Mao, Li, Zheng, Qu, Bingxin, Song, Mingyang, Du, Yang, Sun, Mingrui, Wang, Di

arXiv.org Artificial IntelligenceSep-10-2025

In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.

large language model, machine learning, translation, (19 more...)

arXiv.org Artificial Intelligence

2509.05209

Country:

North America > Canada (0.46)
Asia > China (0.28)
North America > United States (0.28)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

No Translation Needed: Forecasting Quality from Fertility and Metadata

Lundin, Jessica M., Zhang, Ada, Adelani, David, Carroll, Cody

arXiv.org Artificial IntelligenceSep-9-2025

We show that translation quality can be predicted with surprising accuracy \textit{without ever running the translation system itself}. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ($R^{2}=0.66$ for XX$\rightarrow$English and $R^{2}=0.72$ for English$\rightarrow$XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.

machine learning, natural language, translation, (19 more...)

arXiv.org Artificial Intelligence

2509.05425

Country: Europe > Middle East > Malta (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios

Bulla, Luana, Tuccio, Gabriele, Mongiovì, Misael, Gangemi, Aldo

arXiv.org Artificial IntelligenceSep-9-2025

Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to capture the full linguistic richness of sign languages. To address this limitation, we propose Advanced Use of LLMs for Sign Language Translation (AulSign), a novel method that leverages Large Language Models via dynamic prompting and in-context learning with sample selection and subsequent sign association. Despite their impressive abilities in processing text, LLMs lack intrinsic knowledge of sign languages; therefore, they are unable to natively perform this kind of translation. To overcome this limitation, we associate the signs with compact descriptions in natural language and instruct the model to use them. We evaluate our method on both English and Italian languages using SignBank+, a recognized benchmark in the field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior performance compared to state-of-the-art models in low-data scenario. Our findings demonstrate the effectiveness of AulSign, with the potential to enhance accessibility and inclusivity in communication technologies for underrepresented linguistic communities.

large language model, machine learning, translation, (19 more...)

arXiv.org Artificial Intelligence

2508.18183

Country: Europe (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Promising Solution (0.68)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

PRIM: Towards Practical In-Image Multilingual Machine Translation

Tian, Yanzhi, Liu, Zeming, Liu, Zhengyang, Feng, Chong, Li, Xin, Huang, Heyan, Guo, Yuhang

arXiv.org Artificial IntelligenceSep-8-2025

In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.05146

Country:

North America > United States (0.46)
Europe > Austria (0.30)
North America > Mexico (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

Guo, Jiaxin, Wei, Daimeng, Luo, Yuanchang, Chen, Xiaoyu, Wu, Zhanglin, Yang, Huan, Shang, Hengchao, Li, Zongyao, Rao, Zhiqiang, Yang, Jinlong, Yang, Hao

arXiv.org Artificial IntelligenceSep-5-2025

Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.03809

Country:

North America > Mexico (0.28)
Asia > China (0.28)
Asia > Middle East > UAE (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)

Add feedback

SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

Imai, Saki, İnan, Mert, Sicilia, Anthony, Alikhani, Malihe

arXiv.org Artificial IntelligenceSep-5-2025

Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

artificial intelligence, machine translation, natural language, (14 more...)

arXiv.org Artificial Intelligence

2509.03791

Country:

Europe (1.00)
North America > United States (0.46)

Genre: Research Report > New Finding (0.68)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Generation (0.82)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)

Add feedback

Exploring NLP Benchmarks in an Extremely Low-Resource Setting

Nuha, Ulin, Jatowt, Adam

arXiv.org Artificial IntelligenceSep-5-2025

The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.

large language model, machine learning, translation, (17 more...)

arXiv.org Artificial Intelligence

2509.03962

Country: Europe > Italy (0.29)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Alabdullah, Abdullah, Han, Lifeng, Lin, Chenghua

arXiv.org Artificial IntelligenceSep-4-2025

Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday communication in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: (i) a comprehensive evaluation of training-free prompting techniques, and (ii) the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (LLMs) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. Ara-TEaR is designed as a three-stage self-refinement prompting process, targeting frequent meaning-transfer and adaptation errors in DA-MSA translation. In this evaluation, GPT-4o achieved the highest performance across all prompting settings. For fine-tuning LLMs, a quantized Gemma2-9B model achieved a chrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% chrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.

large language model, machine learning, translation, (20 more...)

arXiv.org Artificial Intelligence

2507.20301

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How Important is `Perfect' English for Machine Translation Prompts?

Schmidtová, Patrícia, Bafna, Niyati, Aycock, Seth, Vico, Gianluca, Kamzela, Wiktor, Hämmerl, Katharina, Zouhar, Vilém

arXiv.org Artificial IntelligenceSep-3-2025

Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs' performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quantitative analysis and qualitative insights into how the models respond to increasing noise in the user prompt. The prompt quality strongly affects the translation performance: With many errors, even a good prompt can underperform a minimal or poor prompt without errors. However, different noise types impact translation quality differently, with character-level and combined noisers degrading performance more than phrasal perturbations. Qualitative analysis reveals that lower prompt quality largely leads to poorer instruction following, rather than directly affecting translation quality itself. Further, LLMs can still translate in scenarios with overwhelming random noise that would make the prompt illegible to humans.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.09509

Country:

Asia (0.68)
North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback