AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages

Coleman, Jared, Krishnamachari, Bhaskar, Iskarous, Khalil, Rosales, Ruben

arXiv.org Artificial IntelligenceMay-16-2024

We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.

arXiv.org Artificial Intelligence

2405.08997

Country:

North America > United States > California (0.48)
Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)

Genre:

Overview (0.46)
Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback

Word Alignment as Preference for Machine Translation

Wu, Qiyu, Nagata, Masaaki, Miao, Zhongtao, Tsuruoka, Yoshimasa

arXiv.org Artificial IntelligenceMay-15-2024

The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.

hallucination and omission, omission, translation, (13 more...)

arXiv.org Artificial Intelligence

2405.09223

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(10 more...)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

Simplicity within biological complexity

Przulj, Natasa, Malod-Dognin, Noel

arXiv.org Artificial IntelligenceMay-15-2024

Heterogeneous, interconnected, systems-level, molecular data have become increasingly available and key in precision medicine. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, repurpose known and discover new drugs to personalize medical treatment. Existing methodologies are limited and a paradigm shift is needed to achieve quantitative and qualitative breakthroughs. In this perspective paper, we survey the literature and argue for the development of a comprehensive, general framework for embedding of multi-scale molecular network data that would enable their explainable exploitation in precision medicine in linear time. Network embedding methods map nodes to points in low-dimensional space, so that proximity in the learned space reflects the network's topology-function relationships. They have recently achieved unprecedented performance on hard problems of utilizing few omic data in various biomedical applications. However, research thus far has been limited to special variants of the problems and data, with the performance depending on the underlying topology-function network biology hypotheses, the biomedical applications and evaluation metrics. The availability of multi-omic data, modern graph embedding paradigms and compute power call for a creation and training of efficient, explainable and controllable models, having no potentially dangerous, unexpected behaviour, that make a qualitative breakthrough. We propose to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics. It will lead to a paradigm shift in computational and biomedical understanding of data and diseases that will open up ways to solving some of the major bottlenecks in precision medicine and other domains.

international conference, proceedings, representation, (17 more...)

arXiv.org Artificial Intelligence

2405.09595

Country:

North America > United States (0.28)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > China (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Hematology (1.00)
(3 more...)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
(9 more...)

Add feedback

Using Contextual Information for Sentence-level Morpheme Segmentation

Bhandari, Prabin, Paudel, Abhishek

arXiv.org Artificial IntelligenceMay-14-2024

Recent advancements in morpheme segmentation primarily emphasize word-level segmentation, often neglecting the contextual relevance within the sentence. In this study, we redefine the morpheme segmentation task as a sequence-to-sequence problem, treating the entire sentence as input rather than isolating individual words. Our findings reveal that the multilingual model consistently exhibits superior performance compared to monolingual counterparts. While our model did not surpass the performance of the current state-of-the-art, it demonstrated comparable efficacy with high-resource languages while revealing limitations in low-resource language scenarios.

dataset, morpheme segmentation, segmentation, (11 more...)

arXiv.org Artificial Intelligence

2403.15436

Country: Oceania > Australia > Victoria > Melbourne (0.04)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Add feedback

Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models

Piergentili, Andrea, Savoldi, Beatrice, Negri, Matteo, Bentivogli, Luisa

arXiv.org Artificial IntelligenceMay-14-2024

Machine translation (MT) models are known to suffer from gender bias, especially when translating into languages with extensive gendered morphology. Accordingly, they still fall short in using gender-inclusive language, also representative of non-binary identities. In this paper, we look at gender-inclusive neomorphemes, neologistic elements that avoid binary gender markings as an approach towards fairer MT. In this direction, we explore prompting techniques with large language models (LLMs) to translate from English into Italian using neomorphemes. So far, this area has been under-explored due to its novelty and the lack of publicly available evaluation resources. We fill this gap by releasing Neo-GATE, a resource designed to evaluate gender-inclusive en-it translation with neomorphemes. With Neo-GATE, we assess four LLMs of different families and sizes and different prompt formats, identifying strengths and weaknesses of each on this novel task for MT.

neomorpheme, paradigm, translation, (13 more...)

arXiv.org Artificial Intelligence

2405.08477

Country:

Europe > Finland > Pirkanmaa > Tampere (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Ontario > Toronto (0.04)
(15 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Nagata, Masaaki, Morishita, Makoto, Chousa, Katsuki, Yasuda, Norihito

arXiv.org Artificial IntelligenceMay-14-2024

Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

proceedings, sentence pair, translation, (15 more...)

arXiv.org Artificial Intelligence

2405.09017

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Europe > Slovenia (0.05)
(16 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Communications > Social Media > Crowdsourcing (0.95)
Information Technology > Data Science > Data Mining > Web Mining (0.81)

Add feedback

Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems

Lu, Bo-Han, Lin, Yi-Hsuan, Lee, En-Shiun Annie, Tsai, Richard Tzong-Han

arXiv.org Artificial IntelligenceMay-14-2024

Machine translation focuses mainly on high-resource languages (HRLs), while low-resource languages (LRLs) like Taiwanese Hokkien are relatively under-explored. The study aims to address this gap by developing a dual translation model between Taiwanese Hokkien and both Traditional Mandarin Chinese and English. We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Taiwanese Hokkien Han and Traditional Mandarin Chinese. Our comprehensive experiments involve translation tasks across various writing systems of Taiwanese Hokkien as well as between Taiwanese Hokkien and other HRLs. We find that the use of a limited monolingual corpus still further improves the model's Taiwanese Hokkien capabilities. We then utilize our translation model to standardize all Taiwanese Hokkien writing systems into Hokkien Han, resulting in further performance improvements. Additionally, we introduce an evaluation method incorporating back-translation and GPT-4 to ensure reliable translation quality assessment even for LRLs. The study contributes to narrowing the resource gap for Taiwanese Hokkien and empirically investigates the advantages and limitations of pre-training and fine-tuning based on LLaMA 2. Keywords: low-resource language, large language model, neural machine translation, Taiwanese Hokkien

hokkien, taiwanese hokkien, translation, (15 more...)

arXiv.org Artificial Intelligence

2403.12024

Country:

Asia > Taiwan (0.15)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(7 more...)

Genre: Research Report > New Finding (0.93)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Empirical Study on the Robustness of Massively Multilingual Neural Machine Translation

Supryadi, null, Pan, Leiyu, Xiong, Deyi

arXiv.org Artificial IntelligenceMay-13-2024

Massively multilingual neural machine translation (MMNMT) has been proven to enhance the translation quality of low-resource languages. In this paper, we empirically investigate the translation robustness of Indonesian-Chinese translation in the face of various naturally occurring noise. To assess this, we create a robustness evaluation benchmark dataset for Indonesian-Chinese translation. This dataset is automatically translated into Chinese using four NLLB-200 models of different sizes. We conduct both automatic and human evaluations. Our in-depth analysis reveal the correlations between translation error types and the types of noise present, how these correlations change across different model sizes, and the relationships between automatic evaluation indicators and human evaluation indicators.

computational linguistic, robustness, translation, (13 more...)

arXiv.org Artificial Intelligence

2405.07673

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
Asia > Indonesia (0.04)
(9 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (0.68)
Information Technology > Services (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Krey\`ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

Robinson, Nathaniel R., Dabre, Raj, Shurtz, Ammon, Dent, Rasul, Onesi, Onenamiyi, Monroc, Claire Bizon, Grobol, Loïc, Muhammad, Hasan, Garg, Ashi, Etori, Naome A., Tiyyala, Vijay Murari, Samuel, Olanrewaju, Stutzman, Matthew Dean, Odoom, Bismarck Bamfo, Khudanpur, Sanjeev, Richardson, Stephen D., Murray, Kenton

arXiv.org Artificial IntelligenceMay-13-2024

A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 26 of 34 translation directions.

creole language, proceedings, translation, (11 more...)

arXiv.org Artificial Intelligence

2405.05376

Country:

North America > Central America (0.34)
Africa > Seychelles (0.28)
North America > Haiti (0.14)
(62 more...)

Genre: Research Report (1.00)

Industry: Government > Regional Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Sign Stitching: A Novel Approach to Sign Language Production

Walsh, Harry, Saunders, Ben, Bowden, Richard

arXiv.org Artificial IntelligenceMay-13-2024

Sign Language Production (SLP) is a challenging task, given the limited resources available and the inherent diversity within sign data. As a result, previous works have suffered from the problem of regression to the mean, leading to under-articulated and incomprehensible signing. In this paper, we propose using dictionary examples and a learnt codebook of facial expressions to create expressive sign language sequences. However, simply concatenating signs and adding the face creates robotic and unnatural sequences. To address this we present a 7-step approach to effectively stitch sequences together. First, by normalizing each sign into a canonical pose, cropping, and stitching we create a continuous sequence. Then, by applying filtering in the frequency domain and resampling each sign, we create cohesive natural sequences that mimic the prosody found in the original data. We leverage a SignGAN model to map the output to a photo-realistic signer and present a complete Text-to-Sign (T2S) SLP pipeline. Our evaluation demonstrates the effectiveness of the approach, showcasing state-of-the-art performance across all datasets. Finally, a user evaluation shows our approach outperforms the baseline model and is capable of producing realistic sign language sequences.

arxiv preprint arxiv, proceedings, sequence, (11 more...)

arXiv.org Artificial Intelligence

2405.07663

Country:

Europe > United Kingdom > England > Surrey > Guildford (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)
Europe > Netherlands (0.04)
(3 more...)

Genre:

Research Report > Promising Solution (0.41)
Overview > Innovation (0.41)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)

Add feedback