AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

Relay Decoding: Concatenating Large Language Models for Machine Translation

Fu, Chengpeng, Feng, Xiaocheng, Huang, Yichong, Huo, Wenshuai, Li, Baohang, Wang, Hui, Qin, Bin, Liu, Ting

arXiv.org Artificial IntelligenceMay-5-2024

Leveraging large language models for machine translation has demonstrated promising results. However, it does require the large language models to possess the capability of handling both the source and target languages in machine translation. When it is challenging to find large models that support the desired languages, resorting to continuous learning methods becomes a costly endeavor. To mitigate these expenses, we propose an innovative approach called RD (Relay Decoding), which entails concatenating two distinct large models that individually support the source and target languages. By incorporating a simple mapping layer to facilitate the connection between these two models and utilizing a limited amount of parallel data for training, we successfully achieve superior results in the machine translation task. Experimental results conducted on the Multi30k and WikiMatrix datasets validate the effectiveness of our proposed method.

language model, machine translation, translation, (10 more...)

arXiv.org Artificial Intelligence

2405.02933

Country:

Asia > Singapore (0.05)
Asia > China > Heilongjiang Province > Harbin (0.05)
North America > United States > Pennsylvania (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry: Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Sentiment Analysis Across Languages: Evaluation Before and After Machine Translation to English

Kathunia, Aekansh, Kaif, Mohammad, Arora, Nalin, Narotam, N

arXiv.org Artificial IntelligenceMay-5-2024

People communicate in more than 7,000 languages around the world, with around 780 languages spoken in India alone. Despite this linguistic diversity, research on Sentiment Analysis has predominantly focused on English text data, resulting in a disproportionate availability of sentiment resources for English. This paper examines the performance of transformer models in Sentiment Analysis tasks across multilingual datasets and text that has undergone machine translation. By comparing the effectiveness of these models in different linguistic contexts, we gain insights into their performance variations and potential implications for sentiment analysis across diverse languages. We also discuss the shortcomings and potential for future work towards the end.

bert 0, machine translation, sentiment analysis, (11 more...)

arXiv.org Artificial Intelligence

2405.02887

Country:

Asia > India (0.24)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Tōhoku (0.05)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

Add feedback

On the Information Redundancy in Non-Autoregressive Translation

Wang, Zhihao, Wang, Longyue, Su, Jinsong, Yao, Junfeng, Tu, Zhaopeng

arXiv.org Artificial IntelligenceMay-4-2024

Token repetition is a typical form of multi-modal problem in fully non-autoregressive translation (NAT). In this work, we revisit the multi-modal problem in recently proposed NAT models. Our study reveals that these advanced models have introduced other types of information redundancy errors, which cannot be measured by the conventional metric - the continuous repetition ratio. By manually annotating the NAT outputs, we identify two types of information redundancy errors that correspond well to lexical and reordering multi-modality problems. Since human annotation is time-consuming and labor-intensive, we propose automatic metrics to evaluate the two types of redundant errors. Our metrics allow future studies to evaluate new methods and gain a more comprehensive understanding of their effectiveness.

nat model, redundancy, translation, (14 more...)

arXiv.org Artificial Intelligence

2405.02673

Country:

Asia > China > Fujian Province > Xiamen (0.04)
North America > United States > Maine (0.04)
Asia > Taiwan (0.04)

Genre: Research Report (0.40)

Industry:

Health & Medicine (0.68)
Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.48)

Add feedback

The Call for Socially Aware Language Technologies

Yang, Diyi, Hovy, Dirk, Jurgens, David, Plank, Barbara

arXiv.org Artificial IntelligenceMay-3-2024

Language technologies have made enormous progress, especially with the introduction of large language models (LLMs). On traditional tasks such as machine translation and sentiment analysis, these models perform at near-human level. These advances can, however, exacerbate a variety of issues that models have traditionally struggled with, such as bias, evaluation, and risks. In this position paper, we argue that many of these issues share a common core: a lack of awareness of the factors, context, and implications of the social environment in which NLP operates, which we call social awareness. While NLP is getting better at solving the formal linguistic aspects, limited progress has been made in adding the social awareness required for language applications to work in all situations for all users. Integrating social awareness into NLP models will make applications more natural, helpful, and safe, and will open up new possibilities. Thus we argue that substantial challenges remain for NLP to develop social awareness and that we are just at the beginning of a new era for the field.

computational linguistic, proceedings, social awareness, (12 more...)

arXiv.org Artificial Intelligence

2405.02411

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Singapore (0.04)
(14 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

UQA: Corpus for Urdu Question Answering

Arif, Samee, Farid, Sualeha, Athar, Awais, Raza, Agha Ali

arXiv.org Artificial IntelligenceMay-2-2024

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.

computational linguistic, dataset, paragraph, (15 more...)

arXiv.org Artificial Intelligence

2405.01458

Country:

Asia > Pakistan > Punjab > Lahore Division > Lahore (0.05)
Europe > Italy > Tuscany > Florence (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

On the Effect of (Near) Duplicate Subwords in Language Modelling

Schäfer, Anton, Hofmann, Thomas, Schlag, Imanol, Pimentel, Tiago

arXiv.org Artificial IntelligenceMay-2-2024

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.

duplicate, information, subword, (16 more...)

arXiv.org Artificial Intelligence

2404.06508

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Arizona > Maricopa County > Scottsdale (0.04)
North America > Dominican Republic (0.04)
(5 more...)

Genre: Research Report > Experimental Study (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

The IgboAPI Dataset: Empowering Igbo Language Technologies through Multi-dialectal Enrichment

Emezue, Chris Chinenye, Okoh, Ifeoma, Mbonu, Chinedu, Chukwuneke, Chiamaka, Lal, Daisy, Ezeani, Ignatius, Rayson, Paul, Onwuzulike, Ijemma, Okeke, Chukwuma, Nweya, Gerald, Ogbonna, Bright, Oraegbunam, Chukwuebuka, Awo-Ndubuisi, Esther Chidinma, Osuagwu, Akudo Amarachukwu, Nmezi, Obioha

arXiv.org Artificial IntelligenceMay-2-2024

The Igbo language is facing a risk of becoming endangered, as indicated by a 2025 UNESCO study. This highlights the need to develop language technologies for Igbo to foster communication, learning and preservation. To create robust, impactful, and widely adopted language technologies for Igbo, it is essential to incorporate the multi-dialectal nature of the language. The primary obstacle in achieving dialectal-aware language technologies is the lack of comprehensive dialectal datasets. In response, we present the IgboAPI dataset, a multi-dialectal Igbo-English dictionary dataset, developed with the aim of enhancing the representation of Igbo dialects. Furthermore, we illustrate the practicality of the IgboAPI dataset through two distinct studies: one focusing on Igbo semantic lexicon and the other on machine translation. In the semantic lexicon project, we successfully establish an initial Igbo semantic lexicon for the Igbo semantic tagger, while in the machine translation study, we demonstrate that by finetuning existing machine translation systems using the IgboAPI dataset, we significantly improve their ability to handle dialectal variations in sentences.

dataset, dialect, translation, (15 more...)

arXiv.org Artificial Intelligence

2405.00997

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Indonesia > Bali (0.04)
Africa > Nigeria > Oyo State > Ibadan (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Efficient Sample-Specific Encoder Perturbations

Fathullah, Yassir, Gales, Mark J. F.

arXiv.org Artificial IntelligenceMay-1-2024

Encoder-decoder foundation models have displayed state-of-the-art performance on a range of autoregressive sequence tasks. This paper proposes a simple and lightweight modification to such systems to control the behaviour according to a specific attribute of interest. This paper proposes a novel inference-efficient approach to modifying the behaviour of an encoder-decoder system according to a specific attribute of interest. Specifically, we show that a small proxy network can be used to find a sample-by-sample perturbation of the encoder output of a frozen foundation model to trigger the decoder to generate improved decodings. This work explores a specific realization of this framework focused on improving the COMET performance of Flan-T5 on Machine Translation and the WER of Whisper foundation models on Speech Recognition. Results display consistent improvements in performance evaluated through COMET and WER respectively. Furthermore, experiments also show that the proxies are robust to the exact nature of the data used to train them and can extend to other domains.

computational linguistic, language model, proceedings, (11 more...)

arXiv.org Artificial Intelligence

2405.01601

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)
(5 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Context-Aware Machine Translation with Source Coreference Explanation

Vu, Huy Hien, Kamigaito, Hidetaka, Watanabe, Taro

arXiv.org Artificial IntelligenceApr-30-2024

Despite significant improvements in enhancing the quality of translation, context-aware machine translation (MT) models underperform in many cases. One of the main reasons is that they fail to utilize the correct features from context when the context is too long or their models are overly complex. This can lead to the explain-away effect, wherein the models only consider features easier to explain predictions, resulting in inaccurate translations. To address this issue, we propose a model that explains the decisions made for translation by predicting coreference features in the input. We construct a model for input coreference by exploiting contextual features from both the input and translation output representations on top of an existing MT model. We evaluate and analyze our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset, demonstrating an improvement of over 1.0 BLEU score when compared with other context-aware models.

machine learning, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2404.19505

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Europe > Denmark > Capital Region > Copenhagen (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(24 more...)

Genre: Research Report > New Finding (0.93)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Which Nigerian-Pidgin does Generative AI speak?: Issues about Representativeness and Bias for Multilingual and Low Resource Languages

Adelani, David Ifeoluwa, Doğruöz, A. Seza, Shode, Iyanuoluwa, Aremu, Anuoluwapo

arXiv.org Artificial IntelligenceApr-30-2024

Naija is the Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese and Indigenous languages). Although it has mainly been a spoken language until recently, there are currently two written genres (BBC and Wikipedia) in Naija. Through statistical analyses and Machine Translation experiments, we prove that these two genres do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on Naija written in the BBC genre. In other words, Naija written in Wikipedia genre is not represented in Generative AI.

bbc genre, naija, wikipedia genre, (10 more...)

arXiv.org Artificial Intelligence

2404.19442

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Africa > West Africa (0.04)
Africa > Nigeria > Plateau State > Jos (0.04)
(12 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.81)

Add feedback