AITopics | backtranslation

Collaborating Authors

backtranslation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

1763ea5a7e72dd7ee64073c2dda7a7a8-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-7-2026, 14:56:51 GMT

machine translation, reviewer, translation, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Exploring Parameter-Efficient Fine-Tuning and Backtranslation for the WMT 25 General Translation Task

Fujita, Felipe, Takada, Hideyuki

arXiv.org Artificial IntelligenceNov-18-2025

In this paper, we explore the effectiveness of combining fine-tuning and backtranslation on a small Japanese corpus for neural machine translation. Starting from a baseline English{\textrightarrow}Japanese model (COMET = 0.460), we first apply backtranslation (BT) using synthetic data generated from monolingual Japanese corpora, yielding a modest increase (COMET = 0.468). Next, we fine-tune (FT) the model on a genuine small parallel dataset drawn from diverse Japanese news and literary corpora, achieving a substantial jump to COMET = 0.589 when using Mistral 7B. Finally, we integrate both backtranslation and fine-tuning{ -- }first augmenting the small dataset with BT generated examples, then adapting via FT{ -- }which further boosts performance to COMET = 0.597. These results demonstrate that, even with limited training data, the synergistic use of backtranslation and targeted fine-tuning on Japanese corpora can significantly enhance translation quality, outperforming each technique in isolation. This approach offers a lightweight yet powerful strategy for improving low-resource language pairs.

artificial intelligence, machine translation, natural language, (10 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.wmt-1.52

2511.12109

Country:

North America > United States > Pennsylvania (0.14)
Asia > Middle East > UAE (0.14)
Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (0.49)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation

Ki, Dayeon, Duh, Kevin, Carpuat, Marine

arXiv.org Artificial IntelligenceOct-3-2025

As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using (1) error highlights and (2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through (3) backtranslation and (4) question-answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions, receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.

artificial intelligence, natural language, participant, (15 more...)

arXiv.org Artificial Intelligence

2505.24683

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

1763ea5a7e72dd7ee64073c2dda7a7a8-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 06:16:35 GMT

artificial intelligence, natural language, translation, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

The Saturation Point of Backtranslation in High Quality Low Resource English Gujarati Machine Translation

Arif, Arwa

arXiv.org Artificial IntelligenceJun-30-2025

Backtranslation BT is widely used in low resource machine translation MT to generate additional synthetic training data using monolingual corpora. While this approach has shown strong improvements for many language pairs, its effectiveness in high quality, low resource settings remains unclear. In this work, we explore the effectiveness of backtranslation for English Gujarati translation using the multilingual pretrained MBART50 model. Our baseline system, trained on a high quality parallel corpus of approximately 50,000 sentence pairs, achieves a BLEU score of 43.8 on a validation set. We augment this data with carefully filtered backtranslated examples generated from monolingual Gujarati text. Surprisingly, adding this synthetic data does not improve translation performance and, in some cases, slightly reduces it. We evaluate our models using multiple metrics like BLEU, ChrF++, TER, BLEURT and analyze possible reasons for this saturation. Our findings suggest that backtranslation may reach a point of diminishing returns in certain low-resource settings and we discuss implications for future research.

backtranslation, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.21566

Country: Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Pivot Language for Low-Resource Machine Translation

Talwar, Abhimanyu, Laasri, Julien

arXiv.org Artificial IntelligenceMay-22-2025

Certain pairs of languages suffer from lack of a parallel corpus which is large in size and diverse in domain. One of the ways this is overcome is via use of a pivot language. In this paper we use Hindi as a pivot language to translate Nepali into English. We describe what makes Hindi a good candidate for the pivot. We discuss ways in which a pivot language can be used, and use two such approaches - the Transfer Method (fully supervised) and Backtransla-tion (semi-supervised) - to translate Nepali into English. Using the former, we are able to achieve a devtest Set SacreBLEU score of 14.2, which improves the baseline fully supervised score reported by (Guzm an et al., 2019) by 6.6 points. While we are slightly below the semi-supervised baseline score of 15.1, we discuss what may have caused this under-performance, and suggest scope for future work.

artificial intelligence, machine translation, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.14553

Country:

Asia (0.46)
Europe (0.28)

Genre: Research Report (0.42)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

High-Resource Translation:Turning Abundance into Accessibility

Yanampally, Abhiram Reddy

arXiv.org Artificial IntelligenceApr-9-2025

High-Resource Translation: Turning Abundance into Accessibility Y anampally Abhiram Reddy ABV -IIITM Gwalior, MP, India Abstract --This paper presents a novel approach to constructing an English-to-T elugu translation model by leveraging transfer learning techniques and addressing the challenges associated with low-resource languages. Utilizing the Bharat Parallel Corpus Collection (BPCC) as the primary dataset, the model incorporates iterative backtranslation to generate synthetic parallel data, effectively augmenting the training dataset and enhancing the model's translation capabilities. The focus of this research extends beyond mere translation accuracy; it encompasses a comprehensive strategy for improving model performance through data augmentation, optimization of training parameters, and the effective utilization of pre-trained models. By adopting these methodologies, we aim to create a more robust translation system that can handle a diverse range of sentence structures and linguistic nuances inherent to both English and T elugu. This research highlights the significance of innovative data handling techniques and the potential of transfer learning in overcoming the limitations posed by sparse datasets in low-resource languages.This research not only contributes to the field of machine translation but also aims to facilitate better communication and understanding between English and T elugu speakers in real-world contexts. Future work will concentrate on further enhancing the models robustness and expanding its applicability to more complex sentence structures, ultimately ensuring its practical usability across various domains and applications. I NTRODUCTION Machine translation (MT) is a significant subfield of natural language processing (NLP) that focuses on automatically translating text from one language to another.

machine learning, natural language, translation, (21 more...)

arXiv.org Artificial Intelligence

2504.05914

Country:

Asia > India (0.24)
Europe > Finland > Uusimaa > Helsinki (0.04)
Europe > United Kingdom > Scotland (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre:

Research Report (0.70)
Overview (0.48)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Dual-Class Prompt Generation: Enhancing Indonesian Gender-Based Hate Speech Detection through Data Augmentation

Ibrahim, Muhammad Amien, Faisal, null, Winarto, Tora Sangputra Yopie, Sulistiya, Zefanya Delvin

arXiv.org Artificial IntelligenceMar-6-2025

Detecting gender-based hate speech in Indonesian social media remains challenging due to limited labeled datasets. While binary hate speech classification has advanced, a more granular category like gender-targeted hate speech is understudied because of class imbalance issues. This paper addresses this gap by comparing three data augmentation techniques for Indonesian gender-based hate speech detection. We evaluate backtranslation, single-class prompt generation (using only hate speech examples), and our proposed dual-class prompt generation (using both hate speech and non-hate speech examples). Experiments show all augmentation methods improve classification performance, with our dual-class approach achieving the best results (88.5% accuracy, 88.1% F1-score using Random Forest). Semantic similarity analysis reveals dual-class prompt generation produces the most novel content, while T-SNE visualizations confirm these samples occupy distinct feature space regions while maintaining class characteristics. Our findings suggest that incorporating examples from both classes helps language models generate more diverse yet representative samples, effectively addressing limited data challenges in specialized hate speech detection.

dataset, detection, speech detection, (12 more...)

arXiv.org Artificial Intelligence

2503.04279

Country:

Asia > Indonesia > Borneo > Kalimantan > East Kalimantan > Nusantara (0.05)
Asia > Indonesia > Java > Jakarta > Jakarta (0.05)
North America > United States > Hawaii (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization

Chan, Willy, Souliman, Michael, Nordhagen, Jakob, Miranda, Brando, Obbad, Elyas, Koyejo, Kai Fronsdal Sanmi

arXiv.org Artificial IntelligenceFeb-18-2025

Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) language models. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data scale. Our findings provide evidence that employing our proposed approaches to generate synthetic data, which prioritizes quality over volume, improves the Autoformalization performance of LLMs as measured by standard benchmarks such as ProofNet. Crucially, our approach outperforms pretrained models using a minimal number of tokens. We also show, through strategic prompting and backtranslation, that our approaches surpass the performance of fine-tuning with extensive multilingual datasets such as MMA on ProofNet with only 1/150th of the tokens. Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize proofs, thereby accelerating AI for math.

dataset, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2502.15795

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Defending LLMs against Jailbreaking Attacks via Backtranslation

Wang, Yihan, Shi, Zhouxing, Bai, Andrew, Hsieh, Cho-Jui

arXiv.org Artificial IntelligenceJun-6-2024

Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at \url{https://github.com/YihanWang617/llm-jailbreaking-defense}, and the code for reproducing our experiments is available at \url{https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation}.

developer mode, target model, vicuna-13b-v1, (15 more...)

arXiv.org Artificial Intelligence

2402.16459

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback