AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

Memory-based Cross-modal Semantic Alignment Network for Radiology Report Generation

Tao, Yitian, Ma, Liyan, Yu, Jing, Zhang, Han

arXiv.org Artificial IntelligenceMar-31-2024

Generating radiology reports automatically reduces the workload of radiologists and helps the diagnoses of specific diseases. Many existing methods take this task as modality transfer process. However, since the key information related to disease accounts for a small proportion in both image and report, it is hard for the model to learn the latent relation between the radiology image and its report, thus failing to generate fluent and accurate radiology reports. To tackle this problem, we propose a memory-based cross-modal semantic alignment model (MCSAM) following an encoder-decoder paradigm. MCSAM includes a well initialized long-term clinical memory bank to learn disease-related representations as well as prior knowledge for different modalities to retrieve and use the retrieved memory to perform feature consolidation. To ensure the semantic consistency of the retrieved cross modal prior knowledge, a cross-modal semantic alignment module (SAM) is proposed. SAM is also able to generate semantic visual feature embeddings which can be added to the decoder and benefits report generation. More importantly, to memorize the state and additional information while generating reports with the decoder, we use learnable memory tokens which can be seen as prompts. Extensive experiments demonstrate the promising performance of our proposed method which generates state-of-the-art performance on the MIMIC-CXR dataset.

memory bank, report generation, representation, (13 more...)

arXiv.org Artificial Intelligence

2404.00588

Country:

Asia > China > Shanghai > Shanghai (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

Alshahrani, Saied, Haroon, Hesham, Elfilali, Ali, Njie, Mariama, Matthews, Jeanna

arXiv.org Artificial IntelligenceMar-31-2024

Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles' metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier, XGBoost, as an online application called EGYPTIAN WIKIPEDIA SCANNER and release the extracted, filtered, and labeled datasets to the research community to benefit from our datasets and the online, web-based detection system.

metadata, translation, wikipedia, (15 more...)

arXiv.org Artificial Intelligence

2404.00565

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Minnesota (0.04)
North America > Canada > Ontario > Toronto (0.04)
(11 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

Cognetta, Marco, Hiraoka, Tatsuya, Okazaki, Naoaki, Sennrich, Rico, Pinter, Yuval

arXiv.org Artificial IntelligenceMar-30-2024

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in machine translation implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to improve performance, and is even prone to incurring heavy degradation.

baseline, computational linguistic, subword, (16 more...)

arXiv.org Artificial Intelligence

2404.00397

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Kentucky (0.04)
North America > Canada > Quebec > Montreal (0.04)
(17 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Mickus, Timothee, Zosa, Elaine, Vázquez, Raúl, Vahtola, Teemu, Tiedemann, Jörg, Segonne, Vincent, Raganato, Alessandro, Apidianaki, Marianna

arXiv.org Artificial IntelligenceMar-29-2024

This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling. The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 27 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled -- many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.

computational linguistic, dataset, hallucination, (13 more...)

arXiv.org Artificial Intelligence

2403.07726

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > Singapore (0.04)
North America > United States > Pennsylvania (0.04)
(11 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Meyer, Francois, Buys, Jan

arXiv.org Artificial IntelligenceMar-29-2024

Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.

computational linguistic, cross-lingual transfer, multilingual, (14 more...)

arXiv.org Artificial Intelligence

2403.20157

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Africa > South Africa > Western Cape > Cape Town (0.04)
(14 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Improving Vietnamese-English Medical Machine Translation

Vo, Nhu, Nguyen, Dat Quoc, Le, Dung D., Piccardi, Massimo, Buntine, Wray

arXiv.org Artificial IntelligenceMar-28-2024

Machine translation for Vietnamese-English in the medical domain is still an under-explored research area. In this paper, we introduce MedEV -- a high-quality Vietnamese-English parallel dataset constructed specifically for the medical domain, comprising approximately 360K sentence pairs. We conduct extensive experiments comparing Google Translate, ChatGPT (gpt-3.5-turbo), state-of-the-art Vietnamese-English neural machine translation models and pre-trained bilingual/multilingual sequence-to-sequence models on our new MedEV dataset. Experimental results show that the best performance is achieved by fine-tuning "vinai-translate" for each translation direction. We publicly release our dataset to promote further research.

dataset, sentence pair, translation, (15 more...)

arXiv.org Artificial Intelligence

2403.19161

Country:

Asia > Vietnam (0.05)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre: Research Report (0.84)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Add feedback

EthioMT: Parallel Corpus for Low-resource Ethiopian Languages

Tonja, Atnafu Lambebo, Kolesnikova, Olga, Gelbukh, Alexander, Kalita, Jugal

arXiv.org Artificial IntelligenceMar-28-2024

Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.

ethiopian language, machine translation, translation, (14 more...)

arXiv.org Artificial Intelligence

2403.19365

Country:

Africa > Kenya (0.05)
Africa > Eritrea (0.05)
Africa > Middle East > Somalia (0.05)
(22 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

A Tulu Resource for Machine Translation

Narayanan, Manu, Aepli, Noëmi

arXiv.org Artificial IntelligenceMar-28-2024

We present the first parallel dataset for English-Tulu translation. Tulu, classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. Furthermore, we use this dataset for evaluation purposes in developing our English-Tulu machine translation model. For the model's training, we leverage resources available for related South Dravidian languages. We adopt a transfer learning approach that exploits similarities between high-resource and low-resource languages. This method enables the training of a machine translation system even in the absence of parallel data between the source and target language, thereby overcoming a significant obstacle in machine translation development for low-resource languages. Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points (in September 2023).

computational linguistic, translation, tulu, (13 more...)

arXiv.org Artificial Intelligence

2403.19142

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
Asia > India > Karnataka (0.05)
Europe > Switzerland > Zürich > Zürich (0.04)
(15 more...)

Genre:

Research Report (0.82)
Workflow (0.68)

Industry: Education (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Google's Circle to Search feature will soon handle language translation

EngadgetMar-27-2024, 17:48:02 GMT

Google just announced that it's expanding its recently-launched Circle to Search tool to include language translation, as part of an update to various core services. Circle to Search, as the name suggests, already lets some Android users research stuff by drawing a circle around an object. The forthcoming language translation component won't even require a drawn circle. Google says people will just have to long press the home button or the navigation bar and look for the translate icon. The company showed the tech quickly translating an entire menu with one long press.

google, handle language translation, language translation, (2 more...)

Engadget

Country: North America > Canada (0.07)

Industry: Information Technology > Services (0.42)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Communications > Mobile (0.81)

Add feedback

The Comparison of Translationese in Machine Translation and Human Transation in terms of Translation Relations

Zhou, Fan

arXiv.org Artificial IntelligenceMar-27-2024

This study explores the distinctions between neural machine translation (NMT) and human translation (HT) through the lens of translation relations. It benchmarks HT to assess the translation techniques produced by an NMT system and aims to address three key research questions: the differences in overall translation relations between NMT and HT, how each utilizes non-literal translation techniques, and the variations in factors influencing their use of specific non-literal techniques. The research employs two parallel corpora, each spanning nine genres with the same source texts with one translated by NMT and the other by humans. Translation relations in these corpora are manually annotated on aligned pairs, enabling a comparative analysis that draws on linguistic insights, including semantic and syntactic nuances such as hypernyms and alterations in part-of-speech tagging. The results indicate that NMT relies on literal translation significantly more than HT across genres. While NMT performs comparably to HT in employing syntactic non-literal translation techniques, it falls behind in semantic-level performance.

translation, translation relation, translation technique, (15 more...)

arXiv.org Artificial Intelligence

2404.08661

Country:

Europe > Czechia > Prague (0.04)
Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
North America > United States > Virginia > Fairfax County > Fairfax (0.04)
(13 more...)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.48)

Industry:

Health & Medicine (0.92)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback