AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

Köksal, Abdullatif, Severini, Silvia, Schütze, Hinrich

arXiv.org Artificial IntelligenceMar-27-2023

Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose SilverAlign, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance on our silver data correlates well with gold benchmarks for 9 language pairs, making our approach a valid resource for evaluation of different domains and languages when gold data are not available. This addresses the important scenario of missing gold data alignments for low-resource languages.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.06207

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(10 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

Jones, Alex, Caswell, Isaac, Saxena, Ishank, Firat, Orhan

arXiv.org Artificial IntelligenceMar-27-2023

Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Neural machine translation (NMT) has emerged as the dominant way of training machine translation models (Bahdanau ...

artificial intelligence, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2303.15265

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Africa > Niger (0.05)
Oceania (0.04)
(18 more...)

Genre: Research Report > New Finding (0.92)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Translate the Beauty in Songs: Jointly Learning to Align Melody and Translate Lyrics

Li, Chengxi, Fan, Kai, Bu, Jiajun, Chen, Boxing, Huang, Zhongqiang, Yu, Zhi

arXiv.org Artificial IntelligenceMar-27-2023

Song translation requires both translation of lyrics and alignment of music notes so that the resulting verse can be sung to the accompanying melody, which is a challenging problem that has attracted some interests in different aspects of the translation process. In this paper, we propose Lyrics-Melody Translation with Adaptive Grouping (LTAG), a holistic solution to automatic song translation by jointly modeling lyrics translation and lyrics-melody alignment. It is a novel encoder-decoder framework that can simultaneously translate the source lyrics and determine the number of aligned notes at each decoding step through an adaptive note grouping module. To address data scarcity, we commissioned a small amount of training data annotated specifically for this task and used large amounts of augmented data through back-translation. Experiments conducted on an English-Chinese song translation data set show the effectiveness of our model in both automatic and human evaluation.

machine learning, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2303.15705

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(7 more...)

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Sem4SAP: Synonymous Expression Mining From Open Knowledge Graph For Language Model Synonym-Aware Pretraining

Gu, Zhouhong, Jiang, Sihang, Huang, Wenhao, Liang, Jiaqing, Feng, Hongwei, Xiao, Yanghua

arXiv.org Artificial IntelligenceMar-25-2023

The model's ability to understand synonymous expression is crucial in many kinds of downstream tasks. It will make the model to better understand the similarity between context, and more robust to the synonym substitution attack. However, many Pretrained Language Model (PLM) lack synonym knowledge due to limitation of small-scale synsets and PLM's pretraining objectives. In this paper, we propose a framework called Sem4SAP to mine synsets from Open Knowledge Graph (Open-KG) and using the mined synsets to do synonym-aware pretraining for language models. We propose to coarsly filter the content in Open-KG and use the frequency information to better help the clustering process under low-resource unsupervised conditions. We expand the mined synsets by migrating core semantics between synonymous expressions.We also propose two novel and effective synonym-aware pre-training methods for injecting synonym knowledge into PLMs.Extensive experiments demonstrate that Sem4SAP can dramatically outperform the original PLMs and other baselines on ten different tasks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2303.14425

Country:

North America > United States (1.00)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.84)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities

Tonja, Atnafu Lambebo, Belay, Tadesse Destaw, Azime, Israel Abebe, Ayele, Abinew Ali, Mehamed, Moges Ahmed, Kolesnikova, Olga, Yimam, Seid Muhie

arXiv.org Artificial IntelligenceMar-25-2023

This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia. Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This repository can be updated periodically with contributions from other researchers. Our objective is to identify research gaps and disseminate the information to NLP researchers interested in Ethiopian languages and encourage future research in this domain.

ethiopian language, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2303.14406

Country:

Asia > Middle East > Israel (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
Africa > Ethiopia > Southern Nations, Nationalities, and Peoples' Region > Hawassa (0.04)
(14 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficient Methods for Natural Language Processing: A Survey

Treviso, Marcos, Lee, Ji-Ung, Ji, Tianchu, van Aken, Betty, Cao, Qingqing, Ciosici, Manuel R., Hassid, Michael, Heafield, Kenneth, Hooker, Sara, Raffel, Colin, Martins, Pedro H., Martins, André F. T., Forde, Jessica Zosa, Milder, Peter, Simpson, Edwin, Slonim, Noam, Dodge, Jesse, Strubell, Emma, Balasubramanian, Niranjan, Derczynski, Leon, Gurevych, Iryna, Schwartz, Roy

arXiv.org Artificial IntelligenceMar-24-2023

Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.

computational linguistic, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2209.00099

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
(23 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Energy (1.00)
Education > Educational Setting (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

Analyzing the Generalizability of Deep Contextualized Language Representations For Text Classification

Buyukoz, Berfu

arXiv.org Artificial IntelligenceMar-22-2023

This study evaluates the robustness of two state-of-the-art deep contextual language representations, ELMo and DistilBERT, on supervised learning of binary protest news classification and sentiment analysis of product reviews. A "cross-context" setting is enabled using test sets that are distinct from the training data. Specifically, in the news classification task, the models are developed on local news from India and tested on the local news from China. In the sentiment analysis task, the models are trained on movie reviews and tested on customer reviews. This comparison is aimed at exploring the limits of the representative power of today's Natural Language Processing systems on the path to the systems that are generalizable to real-life scenarios. The models are fine-tuned and fed into a Feed-Forward Neural Network and a Bidirectional Long Short Term Memory network. Multinomial Naive Bayes and Linear Support Vector Machine are used as traditional baselines. The results show that, in binary text classification, DistilBERT is significantly better than ELMo on generalizing to the cross-context setting. ELMo is observed to be significantly more robust to the cross-context test data than both baselines. On the other hand, the baselines performed comparably well to ELMo when the training and test data are subsets of the same corpus (no cross-context). DistilBERT is also found to be 30% smaller and 83% faster than ELMo. The results suggest that DistilBERT can transfer generic semantic knowledge to other domains better than ELMo. DistilBERT is also favorable in incorporating into real-life systems for it requires a smaller computational training budget. When generalization is not the utmost preference and test domain is similar to the training domain, the traditional ML algorithms can still be considered as more economic alternatives to deep language representations.

arXiv.org Artificial Intelligence

2303.12936

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (1.00)
Media > News (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Towards Understanding the Generalization of Medical Text-to-SQL Models and Datasets

Tarbell, Richard, Choo, Kim-Kwang Raymond, Dietrich, Glenn, Rios, Anthony

arXiv.org Artificial IntelligenceMar-22-2023

Electronic medical records (EMRs) are stored in relational databases. It can be challenging to access the required information if the user is unfamiliar with the database schema or general database fundamentals. Hence, researchers have explored text-to-SQL generation methods that provide healthcare professionals direct access to EMR data without needing a database expert. However, currently available datasets have been essentially "solved" with state-of-the-art models achieving accuracy greater than or near 90%. In this paper, we show that there is still a long way to go before solving text-to-SQL generation in the medical domain. To show this, we create new splits of the existing medical text-to-SQL dataset MIMICSQL that better measure the generalizability of the resulting models. We evaluate state-of-the-art language models on our new split showing substantial drops in performance with accuracy dropping from up to 92% to 28%, thus showing substantial room for improvement. Moreover, we introduce a novel data augmentation approach to improve the generalizability of the language models. Overall, this paper is the first step towards developing more robust text-to-SQL models in the medical domain. Introduction Electronic medical records (EMRs) are crucial for evaluating and treating patients. For instance, EMRs can be used to predict mortality risk for patients [1-3] and is the basis of knowledge used for billing [4] (e.g., with ICD10 codes).

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2303.12898

Country:

North America > United States > Texas (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > France > Occitanie > Hérault > Montpellier (0.04)
Asia > Middle East > Israel (0.04)

Genre:

Research Report > Promising Solution (0.88)
Research Report > New Finding (0.68)

Industry:

Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Health Care Providers & Services (0.68)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Optical Character Recognition and Transcription of Berber Signs from Images in a Low-Resource Language Amazigh

Corallo, Levi, Varde, Aparna S.

arXiv.org Artificial IntelligenceMar-21-2023

The Berber, or Amazigh language family is a low-resource North African vernacular language spoken by the indigenous Berber ethnic group. It has its own unique alphabet called Tifinagh used across Berber communities in Morocco, Algeria, and others. The Afroasiatic language Berber is spoken by 14 million people, yet lacks adequate representation in education, research, web applications etc. For instance, there is no option of translation to or from Amazigh / Berber on Google Translate, which hosts over 100 languages today. Consequently, we do not find specialized educational apps, L2 (2nd language learner) acquisition, automated language translation, and remote-access facilities enabled in Berber. Motivated by this background, we propose a supervised approach called DaToBS for Detection and Transcription of Berber Signs. The DaToBS approach entails the automatic recognition and transcription of Tifinagh characters from signs in photographs of natural environments. This is achieved by self-creating a corpus of 1862 pre-processed character images; curating the corpus with human-guided annotation; and feeding it into an OCR model via the deployment of CNN for deep learning based on computer vision models. We deploy computer vision modeling (rather than language models) because there are pictorial symbols in this alphabet, this deployment being a novel aspect of our work. The DaToBS experimentation and analyses yield over 92 percent accuracy in our research. To the best of our knowledge, ours is among the first few works in the automated transcription of Berber signs from roadside images with deep learning, yielding high accuracy. This can pave the way for developing pedagogical applications in the Berber language, thereby addressing an important goal of outreach to underrepresented communities via AI in education.

corpus, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2303.13549

Country:

Africa > Middle East > Algeria (0.25)
Europe > United Kingdom (0.24)
North America > United States > Illinois (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology (0.47)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Transformers in Speech Processing: A Survey

Latif, Siddique, Zaidi, Aun, Cuayahuitl, Heriberto, Shamshad, Fahad, Shoukat, Moazzam, Qadir, Junaid

arXiv.org Artificial IntelligenceMar-21-2023

The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their potential for modeling long-range dependencies within speech sequences. Recently, transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and numerous multimodal applications. In this paper, we present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology. By consolidating findings from across the speech technology landscape, we provide a valuable resource for researchers interested in harnessing the power of transformers to advance the field. We identify the challenges encountered by transformers in speech processing while also offering insights into potential solutions to address these issues.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2303.11607

Country:

North America > United States (0.14)
Oceania > Australia > Queensland (0.04)
Europe > United Kingdom > England > Lincolnshire > Lincoln (0.04)
(7 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.48)
Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.67)
Education (0.67)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback