AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

Building a Functional Machine Translation Corpus for Kpelle

Yamoah, Kweku Andoh, Weako, Jackson, Dorley, Emmanuel J.

arXiv.org Artificial IntelligenceMay-27-2025

In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle's potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.18905

Country:

North America > United States (1.00)
Africa (1.00)

Genre:

Research Report (0.70)
Instructional Material (0.66)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

TULUN: Transparent and Adaptable Low-resource Machine Translation

Merx, Raphaël, Suominen, Hanna, Hong, Lois, Thieberger, Nick, Cohn, Trevor, Vylomova, Ekaterina

arXiv.org Artificial IntelligenceMay-27-2025

Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF points over NLLB-54B.

large language model, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2505.18683

Country:

Europe (1.00)
North America > United States (0.68)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Neural Information Processing SystemsMay-26-2025, 21:43:48 GMT

Non-autoregressive Transformers (NATs) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data. Although NATs generate high-quality outputs and offer faster inference than autoregressive models, they tend to produce incoherent and repetitive results due to complex data distribution (e.g., acoustic and linguistic variations in speech). In this work, we introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models. Additionally, we propose to regularize NATs with classifier-free guidance, improving model robustness and translation quality by randomly dropping out source information during training. Our strategies result in a notable improvement of about 7 ASR-BLEU for English-Spanish (En-Es) translation and 2 ASR-BLEU for English-French (En-Fr) on the CVSS benchmark, while attaining over 14\times speedup for En-Es and 5 \times speedup for En-Fr translations compared to autoregressive baselines.

artificial intelligence, machine translation, natural language, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong

Mak, Hei Yi, Lee, Tan

arXiv.org Artificial IntelligenceMay-26-2025

The majority of inhabitants in Hong Kong are able to read and write in standard Chinese butuse Cantonese as theprimary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called writte n Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The riseof written Cantonese is increasingly evident in thecyber world.The growing interaction between Mandarin speakers and Cantonese sp eak-ers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chine se-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of thi s study is on the effort of preparing good amount of training dat a for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar senten ces from parallel articles on Chinese Wikipedia and Cantonese Wikip edia. We show that leveraging highly similar sentence pairs minedfrom Wikipedia improves translation performance in all test set s. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese tr ansla-tion on 6 out of 8 test sets in BLEU scores. Translation exampl es reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.

artificial intelligence, cantonese, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.17816

Country: Asia > China > Hong Kong (0.64)

Genre: Research Report (0.40)

Industry: Media (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Wang, Tianduo, Xu, Lu, Lu, Wei, Cheng, Shanbo

arXiv.org Artificial IntelligenceMay-23-2025

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.16972

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Comparative analysis of subword tokenization approaches for Indian languages

Das, Sudhansu Bala, Choudhury, Samujjal, Mishra, Tapas Kumar, Patra, Bidyut Kr.

arXiv.org Artificial IntelligenceMay-23-2025

Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words into smaller subword units, which is especially beneficial in languages with complicated morphology or a vast vocabulary. It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations. These languages frequently use agglutinative structures, in which words are formed by the combination of multiple morphemes such as suffixes, prefixes, and stems. As a result, a suitable tokenization strategy must be chosen to address these scenarios. This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair Encoding (BPE), and WordPiece Tokenization, affect ILs. The effectiveness of these subword tokenization techniques is investigated in statistical, neural, and multilingual neural machine translation models. All models are examined using standard evaluation metrics, such as the Bilingual Evaluation Understudy (BLEU) score, TER, METEOR, CHRF, RIBES, and COMET. Based on the results, it appears that for the majority of language pairs for the Statistical and Neural MT models, the SentencePiece tokenizer continuously performed better than other tokenizers in terms of BLEU score. However, BPE tokenization outperformed other tokenization techniques in the context of Multilingual Neural Machine Translation model. The results show that, despite using the same tokenizer and dataset for each model, translations from ILs to English surpassed translations from English to ILs.

artificial intelligence, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2505.16868

Country: Asia > India (0.28)

Genre:

Research Report (1.00)
Overview (0.87)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Exploring the Relationship Between Diversity and Quality in Ad Text Generation

Aoki, Yoichi, Murakami, Soichiro, Honda, Ukyo, Kato, Akihiko

arXiv.org Artificial IntelligenceMay-23-2025

In natural language generation for advertising, creating diverse and engaging ad texts is crucial for capturing a broad audience and avoiding advertising fatigue. Regardless of the importance of diversity, the impact of the diversity-enhancing methods in ad text generation -- mainly tested on tasks such as summarization and machine translation -- has not been thoroughly explored. Ad text generation significantly differs from these tasks owing to the text style and requirements. This research explores the relationship between diversity and ad quality in ad text generation by considering multiple factors, such as diversity-enhancing methods, their hyperparameters, input-output formats, and the models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.16418

Country:

North America > United States (0.94)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Pivot Language for Low-Resource Machine Translation

Talwar, Abhimanyu, Laasri, Julien

arXiv.org Artificial IntelligenceMay-22-2025

Certain pairs of languages suffer from lack of a parallel corpus which is large in size and diverse in domain. One of the ways this is overcome is via use of a pivot language. In this paper we use Hindi as a pivot language to translate Nepali into English. We describe what makes Hindi a good candidate for the pivot. We discuss ways in which a pivot language can be used, and use two such approaches - the Transfer Method (fully supervised) and Backtransla-tion (semi-supervised) - to translate Nepali into English. Using the former, we are able to achieve a devtest Set SacreBLEU score of 14.2, which improves the baseline fully supervised score reported by (Guzm an et al., 2019) by 6.6 points. While we are slightly below the semi-supervised baseline score of 15.1, we discuss what may have caused this under-performance, and suggest scope for future work.

artificial intelligence, machine translation, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.14553

Country:

Asia (0.46)
Europe (0.28)

Genre: Research Report (0.42)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation

Zhang, Yuhao, Ma, Xiangnan, Kou, Kaiqi, Liu, Peizhuo, Shan, Weiqiao, Wang, Benyou, Xiao, Tong, Huang, Yuxin, Yu, Zhengtao, Zhu, Jingbo

arXiv.org Artificial IntelligenceMay-22-2025

The success of building textless speech-to-speech translation (S2ST) models has attracted much attention. However, S2ST still faces two main challenges: 1) extracting linguistic features for various speech signals, called cross-modal (CM), and 2) learning alignment of difference languages in long sequences, called cross-lingual (CL). We propose the unit language to overcome the two modeling challenges. The unit language can be considered a text-like representation format, constructed using $n$-gram language modeling. We implement multi-task learning to utilize the unit language in guiding the speech modeling process. Our initial results reveal a conflict when applying source and target unit languages simultaneously. We propose task prompt modeling to mitigate this conflict. We conduct experiments on four languages of the Voxpupil dataset. Our method demonstrates significant improvements over a strong baseline and achieves performance comparable to models trained with text.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.15333

Country:

Asia > China (0.94)
North America > United States (0.68)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Exploring In-Image Machine Translation with Real-World Background

Tian, Yanzhi, Liu, Zeming, Liu, Zhengyang, Guo, Yuhang

arXiv.org Artificial IntelligenceMay-22-2025

In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenario IIMT, we design an IIMT dataset that includes subtitle text with real-world background. However previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on text-image directly, and fuses the translated text-image with the background, to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.15282

Country:

North America (0.46)
Asia (0.28)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback