Goto

Collaborating Authors

 Machine Translation


A Joint Learning Model with Variational Interaction for Multilingual Program Translation

arXiv.org Artificial Intelligence

Programs implemented in various programming languages form the foundation of software applications. To alleviate the burden of program migration and facilitate the development of software systems, automated program translation across languages has garnered significant attention. Previous approaches primarily focus on pairwise translation paradigms, learning translation between pairs of languages using bilingual parallel data. However, parallel data is difficult to collect for some language pairs, and the distribution of program semantics across languages can shift, posing challenges for pairwise program translation. In this paper, we argue that jointly learning a unified model to translate code across multiple programming languages is superior to separately learning from bilingual parallel data. We propose Variational Interaction for Multilingual Program Translation~(VIM-PT), a disentanglement-based generative approach that jointly trains a unified model for multilingual program translation across multiple languages. VIM-PT disentangles code into language-shared and language-specific features, using variational inference and interaction information with a novel lower bound, then achieves program translation through conditional generation. VIM-PT demonstrates four advantages: 1) captures language-shared information more accurately from various implementations and improves the quality of multilingual program translation, 2) mines and leverages the capability of non-parallel data, 3) addresses the distribution shift of program semantics across languages, 4) and serves as a unified model, reducing deployment complexity.


Sign Language Sense Disambiguation

arXiv.org Artificial Intelligence

This project explores methods to enhance sign language translation of German sign language, specifically focusing on disambiguation of homonyms. Sign language is ambiguous and understudied which is the basis for our experiments. We approach the improvement by training transformer-based models on various bodypart representations to shift the focus on said bodypart. To determine the impact of, e.g., the hand or mouth representations, we experiment with different combinations. The results show that focusing on the mouth increases the performance in small dataset settings while shifting the focus on the hands retrieves better results in larger dataset settings. Our results contribute to better accessibility for non-hearing persons by improving the systems powering digital assistants, enabling a more accurate interaction. The code for this project can be found on GitHub.


Machine Translation with Large Language Models: Decoder Only vs. Encoder-Decoder

arXiv.org Artificial Intelligence

This project, titled "Machine Translation with Large Language Models: Decoder-only vs. Encoder-Decoder," aims to develop a multilingual machine translation (MT) model. Focused on Indian regional languages, especially Telugu, Tamil, and Malayalam, the model seeks to enable accurate and contextually appropriate translations across diverse language pairs. By comparing Decoder-only and Encoder-Decoder architectures, the project aims to optimize translation quality and efficiency, advancing cross-linguistic communication tools.The primary objective is to develop a model capable of delivering high-quality translations that are accurate and contextually appropriate. By leveraging large language models, specifically comparing the effectiveness of Decoder-only and Encoder-Decoder architectures, the project seeks to optimize translation performance and efficiency across multilingual contexts. Through rigorous experimentation and analysis, this project aims to advance the field of machine translation, contributing valuable insights into the effectiveness of different model architectures and paving the way for enhanced cross-linguistic communication tools.


Mitsubishi Electric develops multilingual translation system for meetings

The Japan Times

Mitsubishi Electric said Tuesday that it has developed a prototype for a multilingual system that shows words on a screen in different languages. The Japanese company hopes that the system will be used on occasions such as morning assembly meetings at factories where information needs to be related accurately to a large number of workers, including non-Japanese ones. Mitsubishi Electric aims to put the system into commercial use as early as fiscal 2025, which begins in April next year. The company also expects the system to be used for tourism purposes. The system translates a prepared script written in Japanese into 17 other languages, with the screen showing sentences in four languages, including original Japanese sentences, at once.


Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts

arXiv.org Artificial Intelligence

In this paper we present a step-by-step approach to long-form text translation, drawing on established processes in translation studies. Instead of viewing machine translation as a single, monolithic task, we propose a framework that engages language models in a multi-turn interaction, encompassing pre-translation research, drafting, refining, and proofreading, resulting in progressively improved translations. Extensive automatic evaluations using Gemini 1.5 Pro across ten language pairs show that translating step-by-step yields large translation quality improvements over conventional zeroshot prompting approaches and earlier humanlike baseline strategies, resulting in state-of-theart Figure 1: MetricX-23 quality improvements (where results on


Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation

arXiv.org Artificial Intelligence

Multilingual neural machine translation models support fine-tuning hundreds of languages simultaneously. However, fine-tuning on full parameters solely is inefficient potentially leading to negative interactions among languages. In this work, we demonstrate that the fine-tuning for a language occurs in its intrinsic language-specific subspace with a tiny fraction of entire parameters. Thus, we propose language-specific LoRA to isolate intrinsic language-specific subspaces. Furthermore, we propose architecture learning techniques and introduce a gradual pruning schedule during fine-tuning to exhaustively explore the optimal setting and the minimal intrinsic subspaces for each language, resulting in a lightweight yet effective fine-tuning procedure. The experimental results on a 12-language subset and a 30-language subset of FLORES-101 show that our methods not only outperform full-parameter fine-tuning up to 2.25 spBLEU scores but also reduce trainable parameters to $0.4\%$ for high and medium-resource languages and $1.6\%$ for low-resource ones.


Vision-fused Attack: Advancing Aggressive and Stealthy Adversarial Text against Neural Machine Translation

arXiv.org Artificial Intelligence

While neural machine translation (NMT) models achieve success in our daily lives, they show vulnerability to adversarial attacks. Despite being harmful, these attacks also offer benefits for interpreting and enhancing NMT models, thus drawing increased research attention. However, existing studies on adversarial attacks are insufficient in both attacking ability and human imperceptibility due to their sole focus on the scope of language. This paper proposes a novel vision-fused attack (VFA) framework to acquire powerful adversarial text, i.e., more aggressive and stealthy. Regarding the attacking ability, we design the vision-merged solution space enhancement strategy to enlarge the limited semantic solution space, which enables us to search for adversarial candidates with higher attacking ability. For human imperceptibility, we propose the perception-retained adversarial text selection strategy to align the human text-reading mechanism. Thus, the finally selected adversarial text could be more deceptive. Extensive experiments on various models, including large language models (LLMs) like LLaMA and GPT-3.5, strongly support that VFA outperforms the comparisons by large margins (up to 81%/14% improvements on ASR/SSIM).


Enhancing AI-based Generation of Software Exploits with Contextual Information

arXiv.org Artificial Intelligence

This practical experience report explores Neural Machine Translation (NMT) models' capability to generate offensive security code from natural language (NL) descriptions, highlighting the significance of contextual understanding and its impact on model performance. Our study employs a dataset comprising real shellcodes to evaluate the models across various scenarios, including missing information, necessary context, and unnecessary context. The experiments are designed to assess the models' resilience against incomplete descriptions, their proficiency in leveraging context for enhanced accuracy, and their ability to discern irrelevant information. The findings reveal that the introduction of contextual data significantly improves performance. However, the benefits of additional context diminish beyond a certain point, indicating an optimal level of contextual information for model training. Moreover, the models demonstrate an ability to filter out unnecessary context, maintaining high levels of accuracy in the generation of offensive security code. This study paves the way for future research on optimizing context use in AI-driven code generation, particularly for applications requiring a high degree of technical precision such as the generation of offensive code.


Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages

arXiv.org Artificial Intelligence

This paper introduces Chain of Translation Prompting (CoTR), a novel strategy designed to enhance the performance of language models in low-resource languages. CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English. The specified task like generation, classification, or any other NLP function is then performed on the translated text, with the option to translate the output back to the original language if needed. All these steps are specified in a single prompt. We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi. The CoTR strategy is applied to various tasks, including sentiment analysis, hate speech classification, subject classification and text generation, and its efficacy is showcased by comparing it with regular prompting methods. Our results underscore the potential of translation-based prompting strategies to significantly improve multilingual LLM performance in low-resource languages, offering valuable insights for future research and applications. We specifically see the highest accuracy improvements with the hate speech detection task. The technique also has the potential to enhance the quality of synthetic data generation for underrepresented languages using LLMs.


BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

arXiv.org Artificial Intelligence

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries Figure 1: An example of a series of merges to produce a out vocabulary refinement during tokenizer token Kentucky. The pre-merge token frequencies are training. Our method improves vocabulary efficiency, demonstrated in corresponding circles. In the vanilla eliminates under-trained tokens, and BPE algorithm, entucky should also be stored in the does not compromise text compression. Our vocabulary, whereas it is redundant after the merge. In experiments show that our method does not this example, the IoS metric effectively captures the reduce the downstream performance, and in intermediate token, as IoS(entucky) T = 0.9.