AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation

Luo, Jiaming, Cherry, Colin, Foster, George

arXiv.org Artificial IntelligenceJan-2-2024

We conduct a large-scale fine-grained comparative analysis of machine translations (MT) against human translations (HT) through the lens of morphosyntactic divergence. Across three language pairs and two types of divergence defined as the structural difference between the source and the target, MT is consistently more conservative than HT, with less morphosyntactic diversity, more convergent patterns, and more one-to-one alignments. Through analysis on different decoding algorithms, we attribute this discrepancy to the use of beam search that biases MT towards more convergent patterns. This bias is most amplified when the convergent pattern appears around 50% of the time in training data. Lastly, we show that for a majority of morphosyntactic divergences, their presence in HT is correlated with decreased MT performance, presenting a greater challenge for MT systems.

computational linguistic, divergence, translation, (14 more...)

arXiv.org Artificial Intelligence

2401.01419

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > Canada > Ontario > Toronto (0.04)
(18 more...)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Machine Translation Testing via Syntactic Tree Pruning

Zhang, Quanjun, Zhai, Juan, Fang, Chunrong, Liu, Jiawei, Sun, Weisong, Hu, Haichuan, Wang, Qingyu

arXiv.org Artificial IntelligenceJan-1-2024

Machine translation systems have been widely adopted in our daily life, making life easier and more convenient. Unfortunately, erroneous translations may result in severe consequences, such as financial losses. This requires to improve the accuracy and the reliability of machine translation systems. However, it is challenging to test machine translation systems because of the complexity and intractability of the underlying neural models. To tackle these challenges, we propose a novel metamorphic testing approach by syntactic tree pruning (STP) to validate machine translation systems. Our key insight is that a pruned sentence should have similar crucial semantics compared with the original sentence. Specifically, STP (1) proposes a core semantics-preserving pruning strategy by basic sentence structure and dependency relations on the level of syntactic tree representation; (2) generates source sentence pairs based on the metamorphic relation; (3) reports suspicious issues whose translations break the consistency property by a bag-of-words model. We further evaluate STP on two state-of-the-art machine translation systems (i.e., Google Translate and Bing Microsoft Translator) with 1,200 source sentences as inputs. The results show that STP can accurately find 5,073 unique erroneous translations in Google Translate and 5,100 unique erroneous translations in Bing Microsoft Translator (400% more than state-of-the-art techniques), with 64.5% and 65.4% precision, respectively. The reported erroneous translations vary in types and more than 90% of them cannot be found by state-of-the-art techniques. There are 9,393 erroneous translations unique to STP, which is 711.9% more than state-of-the-art techniques. Moreover, STP is quite effective to detect translation errors for the original sentences with a recall reaching 74.0%, improving state-of-the-art techniques by 55.1% on average.

erroneous translation, source sentence, translation, (15 more...)

arXiv.org Artificial Intelligence

2401.00751

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
Asia > China > Jiangsu Province > Nanjing (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)

Industry:

Information Technology (1.00)
Education (1.00)
Media (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Self-supervised learning for skin cancer diagnosis with limited training data

Haggerty, Hamish, Chandra, Rohitash

arXiv.org Artificial IntelligenceJan-1-2024

Cancer diagnosis is a well-studied problem in machine learning since early detection of cancer is often the determining factor in prognosis. Supervised deep learning achieves excellent results in cancer image classification, usually through transfer learning. However, these models require large amounts of labelled data and for several types of cancer, large labelled datasets do not exist. In this paper, we demonstrate that a model pre-trained using a self-supervised learning algorithm known as Barlow Twins can outperform the conventional supervised transfer learning pipeline. We juxtapose two base models: i) pretrained in a supervised fashion on ImageNet; ii) pretrained in a self-supervised fashion on ImageNet. Both are subsequently fine tuned on a small labelled skin lesion dataset and evaluated on a large test set. We achieve a mean test accuracy of 70\% for self-supervised transfer in comparison to 66\% for supervised transfer. Interestingly, boosting performance further is possible by self-supervised pretraining a second time (on unlabelled skin lesion images) before subsequent fine tuning. This hints at an alternative path to collecting more labelled data in settings where this is challenging - namely just collecting more unlabelled images. Our framework is applicable to cancer image classification models in the low-labelled data regime.

classification, dataset, imagenet, (16 more...)

arXiv.org Artificial Intelligence

2401.00692

Country:

North America > United States (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)
(3 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Therapeutic Area > Oncology > Skin Cancer (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Normalization of Lithuanian Text Using Regular Expressions

Kasparaitis, Pijus

arXiv.org Artificial IntelligenceJan-1-2024

Text Normalization is an integral part of any text-to-speech synthesis system. In a natural language text, there are elements such as numbers, dates, abbreviations, etc. that belong to other semiotic classes. They are called non-standard words (NSW) and need to be expanded into ordinary words. For this purpose, it is necessary to identify the semiotic class of each NSW. The taxonomy of semiotic classes adapted to the Lithuanian language is presented in the work. Sets of rules are created for detecting and expanding NSWs based on regular expressions. Experiments with three completely different data sets were performed and the accuracy was assessed. Causes of errors are explained and recommendations are given for the development of text normalization rules.

abbreviation, normalization, preposition, (14 more...)

arXiv.org Artificial Intelligence

2312.1766

Country:

Europe > Lithuania > Vilnius County > Vilnius (0.04)
Oceania > Australia (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(6 more...)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.55)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.48)
(2 more...)

Add feedback

Conceptualizing Suicidal Behavior: Utilizing Explanations of Predicted Outcomes to Analyze Longitudinal Social Media Data

Nguyen, Van Minh, Nur, Nasheen, Stern, William, Mercer, Thomas, Sen, Chiradeep, Bhattacharyya, Siddhartha, Tumbiolo, Victor, Goh, Seng Jhing

arXiv.org Artificial IntelligenceDec-30-2023

The COVID-19 pandemic has escalated mental health crises worldwide, with social isolation and economic instability contributing to a rise in suicidal behavior. Suicide can result from social factors such as shame, abuse, abandonment, and mental health conditions like depression, Post-Traumatic Stress Disorder (PTSD), Attention-Deficit/Hyperactivity Disorder (ADHD), anxiety disorders, and bipolar disorders. As these conditions develop, signs of suicidal ideation may manifest in social media interactions. Analyzing social media data using artificial intelligence (AI) techniques can help identify patterns of suicidal behavior, providing invaluable insights for suicide prevention agencies, professionals, and broader community awareness initiatives. Machine learning algorithms for this purpose require large volumes of accurately labeled data. Previous research has not fully explored the potential of incorporating explanations in analyzing and labeling longitudinal social media data. In this study, we employed a model explanation method, Layer Integrated Gradients, on top of a fine-tuned state-of-the-art language model, to assign each token from Reddit users' posts an attribution score for predicting suicidal ideation. By extracting and analyzing attributions of tokens from the data, we propose a methodology for preliminary screening of social media posts for suicidal ideation without using large language models during inference.

attribution, social media data, tf-idf, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICMLA58977.2023.00316

2312.08299

Country:

North America > United States > Florida > Brevard County > Melbourne (0.05)
North America > United States > Maryland (0.04)
Asia > Middle East > UAE (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Attention Deficit/Hyperactivity Disorder (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.89)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Translating Hanja Historical Documents to Contemporary Korean and English

Son, Juhee, Jin, Jiho, Yoo, Haneul, Bak, JinYeong, Cho, Kyunghyun, Oh, Alice

arXiv.org Artificial IntelligenceDec-29-2023

The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea. The Annals were originally written in an archaic Korean writing system, `Hanja', and were translated into Korean from 1968 to 1993. The resulting translation was however too literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012. Since then, the records of only one king have been completed in a decade. In parallel, expert translators are working on English translation, also at a slow pace and produced only one king's records in English so far. Thus, we propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English. Built on top of multilingual neural machine translation, H2KE learns to translate a historical document written in Hanja, from both a full dataset of outdated Korean translation and a small dataset of more recently translated contemporary Korean and English. We compare our method against two baselines: a recent model that simultaneously learns to restore and translate Hanja historical document and a Transformer based model trained only on newly translated corpora. The experiments reveal that our method significantly outperforms the baselines in terms of BLEU scores for both contemporary Korean and English translations. We further conduct extensive human evaluation which shows that our translation is preferred over the original expert translations by both experts and non-expert Korean speakers.

evaluation, hanja, translation, (15 more...)

arXiv.org Artificial Intelligence

2205.10019

Country:

North America > United States > New York (0.04)
Europe > Spain (0.04)
Asia > South Korea (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

Sriwirote, Panyut, Thapiang, Jalinee, Timtong, Vasan, Rutherford, Attapol T.

arXiv.org Artificial IntelligenceDec-28-2023

While WangchanBERTa has become the de facto standard in transformer-based Thai language modeling, it still has shortcomings in regard to the understanding of foreign words, most notably English words, which are often borrowed without orthographic assimilation into Thai in many contexts. We identify the lack of foreign vocabulary in WangchanBERTa's tokenizer as the main source of these shortcomings. We then expand WangchanBERTa's vocabulary via vocabulary transfer from XLM-R's pretrained tokenizer and pretrain a new model using the expanded tokenizer, starting from WangchanBERTa's checkpoint, on a new dataset that is larger than the one used to train WangchanBERTa. Our results show that our new pretrained model, PhayaThaiBERT, outperforms WangchanBERTa in many downstream tasks and datasets.

dataset, huggingface, wangchanberta, (15 more...)

arXiv.org Artificial Intelligence

2311.12475

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

AI-driven platform for systematic nomenclature and intelligent knowledge acquisition of natural medicinal materials

Yang, Zijie, Yin, Yongjing, Kong, Chaojun, Chi, Tiange, Tao, Wufan, Zhang, Yue, Xu, Tian

arXiv.org Artificial IntelligenceDec-27-2023

Natural Medicinal Materials (NMMs) have a long history of global clinical applications, accompanied by extensive informational records. Despite their significant impact on healthcare, the field faces a major challenge: the non-standardization of NMM knowledge, stemming from historical complexities and causing limitations in broader applications. To address this, we introduce a Systematic Nomenclature for NMMs, underpinned by ShennongAlpha, an AI-driven platform designed for intelligent knowledge acquisition. This nomenclature system enables precise identification and differentiation of NMMs. ShennongAlpha, cataloging over ten thousand NMMs with standardized bilingual information, enhances knowledge management and application capabilities, thereby overcoming traditional barriers. Furthermore, it pioneers AI-empowered conversational knowledge acquisition and standardized machine translation. These synergistic innovations mark the first major advance in integrating domain-specific NMM knowledge with AI, propelling research and applications across both NMM and AI fields while establishing a groundbreaking precedent in this crucial area.

ai-driven platform, nmmsn-zh, nomenclature and intelligent knowledge acquisition, (9 more...)

arXiv.org Artificial Intelligence

2401.0002

Country:

Asia > India (0.14)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > Japan (0.04)
(9 more...)

Genre: Research Report (0.63)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(3 more...)

Add feedback

Rethinking Relation Classification with Graph Meaning Representations

Zhou, Li, Chen, Wenyu, Zeng, Dingyi, Zhang, Malu, Hershcovich, Daniel

arXiv.org Artificial IntelligenceDec-27-2023

In the field of natural language understanding, the intersection of neural models and graph meaning representations (GMRs) remains a compelling area of research. Despite the growing interest, a critical gap persists in understanding the exact influence of GMRs, particularly concerning relation extraction tasks. Addressing this, we introduce DAGNN-plus, a simple and parameter-efficient neural architecture designed to decouple contextual representation learning from structural information propagation. Coupled with various sequence encoders and GMRs, this architecture provides a foundation for systematic experimentation on two English and two Chinese datasets. Our empirical analysis utilizes four different graph formalisms and nine parsers. The results yield a nuanced understanding of GMRs, showing improvements in three out of the four datasets, particularly favoring English over Chinese due to highly accurate parsers. Interestingly, GMRs appear less effective in literary-domain datasets compared to general-domain datasets. These findings lay the groundwork for better-informed design of GMRs and parsers to improve relation classification, which is expected to tangibly impact the future trajectory of natural language understanding research.

computational linguistic, linguistic, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2310.09772

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(20 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

This mind-reading tech using AI can convert brain activity into text

FOX NewsDec-26-2023, 15:00:00 GMT

Kurt Knutsson discusses new technology developed by researchers who have created a portable, non-invasive system that can decode silent thoughts and turn them into text. Imagine if you could communicate with anyone without saying a word, just by thinking. That's the promise of a new technology developed by researchers from the University of Technology Sydney (UTS), who have created a portable, non-invasive system that can decode silent thoughts and turn them into text. CLICK TO GET KURT'S FREE CYBERGUY NEWSLETTER WITH SECURITY ALERTS, QUICK VIDEO TIPS, TECH REVIEWS, AND EASY HOW-TO'S TO MAKE YOU SMARTER The technology, called DeWave, uses an electroencephalogram (EEG) cap to record electrical brain activity through the scalp. It then uses an artificial intelligence (AI) model to segment the EEG wave into distinct units that capture specific characteristics and patterns from the human brain.

cyberguy, technology sydney, university, (15 more...)

FOX News

Industry: Health & Medicine > Therapeutic Area > Neurology (0.89)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.31)

Add feedback