AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation

Nguyen, Linh The, Tran, Nguyen Luong, Doan, Long, Luong, Manh, Nguyen, Dat Quoc

arXiv.org Artificial IntelligenceAug-8-2022

In this paper, we introduce a high-quality and large-scale benchmark dataset for English-Vietnamese speech translation with 508 audio hours, consisting of 331K triplets of (sentence-lengthed audio, English source transcript sentence, Vietnamese target subtitle sentence). We also conduct empirical experiments using strong baselines and find that the traditional "Cascaded" approach still outperforms the modern "End-to-End" approach. To the best of our knowledge, this is the first large-scale English-Vietnamese speech translation study. We hope both our publicly available dataset and study can serve as a starting point for future research and applications on English-Vietnamese speech translation. Our dataset is available at https://github.com/VinAIResearch/PhoST

cascaded, dataset, translation, (14 more...)

arXiv.org Artificial Intelligence

2208.04243

Country:

North America > United States (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Study of Encoder-Decoder Architectures for Code-Mix Search Query Translation

Kulkarni, Mandar, Chennabasavaraj, Soumya, Garera, Nikesh

arXiv.org Artificial IntelligenceAug-7-2022

With the broad reach of the internet and smartphones, e-commerce platforms have an increasingly diversified user base. Since native language users are not conversant in English, their preferred browsing mode is their regional language or a combination of their regional language and English. From our recent study on the query data, we noticed that many of the queries we receive are code-mix, specifically Hinglish i.e. queries with one or more Hindi words written in English (Latin) script. We propose a transformer-based approach for code-mix query translation to enable users to search with these queries. We demonstrate the effectiveness of pre-trained encoder-decoder models trained on a large corpus of the unlabeled English text for this task. Using generic domain translation models, we created a pseudo-labelled dataset for training the model on the search queries and verified the effectiveness of various data augmentation techniques. Further, to reduce the latency of the model, we use knowledge distillation and weight quantization. Effectiveness of the proposed method has been validated through experimental evaluations and A/B testing. The model is currently live on Flipkart app and website, serving millions of queries.

data augmentation, query, translation, (12 more...)

arXiv.org Artificial Intelligence

2208.03713

Country:

Asia > India (0.05)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
North America > United States > District of Columbia > Washington (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.48)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Vernacular Search Query Translation with Unsupervised Domain Adaptation

Kulkarni, Mandar, Garera, Nikesh

arXiv.org Artificial IntelligenceAug-7-2022

With the democratization of e-commerce platforms, an increasingly diversified user base is opting to shop online. To provide a comfortable and reliable shopping experience, it's important to enable users to interact with the platform in the language of their choice. An accurate query translation is essential for Cross-Lingual Information Retrieval (CLIR) with vernacular queries. Due to internet-scale operations, e-commerce platforms get millions of search queries every day. However, creating a parallel training set to train an in-domain translation model is cumbersome. This paper proposes an unsupervised domain adaptation approach to translate search queries without using any parallel corpus. We use an open-domain translation model (trained on public corpus) and adapt it to the query data using only the monolingual queries from two languages. In addition, fine-tuning with a small labeled set further improves the result. For demonstration, we show results for Hindi to English query translation and use mBART-large-50 model as the baseline to improve upon. Experimental results show that, without using any parallel corpus, we obtain more than 20 BLEU points improvement over the baseline while fine-tuning with a small 50k labeled set provides more than 27 BLEU points improvement over the baseline.

adversarial update, query, translation, (11 more...)

arXiv.org Artificial Intelligence

2208.03711

Country:

North America > United States (0.04)
Asia > India (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

What Are Transformer Models In Machine Learning - Big Data Analytics News

#artificialintelligenceAug-3-2022, 05:07:52 GMT

Machine learning refers to a data analysis method, automating analytical model building. This artificial intelligence branch is based on the concept that computer systems can learn from data, identifying patterns, and making decisions with minimal to zero human intervention. Intelligent systems are built on machine learning algorithms to learn from historical data or past experience. Machine learning applications include image recognition and speech recognition, valuable in various industries such as medicine, e-Commerce, manufacturing, and education. In this article, you'll learn more about transformer models in machine learning. The transformer refers to a deep learning model, utilizing the mechanism of attention used in natural language processing (NLP), a branch of artificial intelligence (AI) that deals with the interaction between humans and computers using the natural language.

language translation, transformer model, translation, (12 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cross-Lingual Knowledge Transfer for Clinical Phenotyping

Papaioannou, Jens-Michalis, Grundmann, Paul, van Aken, Betty, Samaras, Athanasios, Kyparissidis, Ilias, Giannakoulas, George, Gers, Felix, Löser, Alexander

arXiv.org Artificial IntelligenceAug-3-2022

Clinical phenotyping enables the automatic extraction of clinical conditions from patient records, which can be beneficial to doctors and clinics worldwide. However, current state-of-the-art models are mostly applicable to clinical notes written in English. We therefore investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language and have a small amount of in-domain data available. We evaluate these strategies for a Greek and a Spanish clinic leveraging clinical notes from different clinical domains such as cardiology, oncology and the ICU. Our results reveal two strategies that outperform the state-of-the-art: Translation-based methods in combination with domain-specific encoders and cross-lingual encoders plus adapters. We find that these strategies perform especially well for classifying rare phenotypes and we advise on which method to prefer in which situation. Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.

dataset, knowledge transfer, translation, (13 more...)

arXiv.org Artificial Intelligence

2208.01912

Country:

Europe > Germany (0.14)
Europe > Greece > Central Macedonia > Thessaloniki (0.05)
North America > United States > Massachusetts (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Health Care Providers & Services (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)

Add feedback

Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective

Raithel, Lisa, Thomas, Philippe, Roller, Roland, Sapina, Oliver, Möller, Sebastian, Zweigenbaum, Pierre

arXiv.org Artificial IntelligenceAug-3-2022

In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community.

computational linguistic, dataset, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2208.02031

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Europe > Germany > Berlin (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(7 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
Health & Medicine > Health Care Providers & Services (0.34)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Silo NLP's Participation at WAT2022

Parida, Shantipriya, Panda, Subhadarshi, Grönroos, Stig-Arne, Granroth-Wilding, Mark, Koistinen, Mika

arXiv.org Artificial IntelligenceAug-2-2022

This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English->Hindi, English->Malayalam, and English->Bengali Multimodal Translation). For text-only translation, we trained Transformers from scratch and fine-tuned mBART-50 models. For multimodal translation, we used the same mBART architecture and extracted object tags from the images to use as visual features concatenated with the text sequence. Our submission tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodal translation (evaluation test), English->Bengali multimodal translation (challenge test), and English->Bengali text-only translation (evaluation test).

machine learning, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2208.01296

Country:

Europe > Finland > Uusimaa > Helsinki (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Sports > Tennis (0.32)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

Silva, Marília Costa Rosendo, Siqueira, Felipe Alves, Tarrega, João Pedro Mantovani, Beinotti, João Vitor Pataca, Nunes, Augusto Sousa, Gardini, Miguel de Mattos, da Silva, Vinícius Adolfo Pereira, da Silva, Nádia Félix Felipe, de Carvalho, André Carlos Ponce de Leon Ferreira

arXiv.org Artificial IntelligenceAug-2-2022

Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works.

algorithm, computational linguistic, reproducibility and distortion issue, (11 more...)

arXiv.org Artificial Intelligence

2208.01712

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(36 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.47)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(3 more...)

Add feedback

Sockeye 3: Fast Neural Machine Translation with PyTorch

Hieber, Felix, Denkowski, Michael, Domhan, Tobias, Barros, Barbara Darques, Ye, Celina Dong, Niu, Xing, Hoang, Cuong, Tran, Ke, Hsu, Benjamin, Nadejde, Maria, Lakew, Surafel, Mathur, Prashant, Currey, Anna, Federico, Marcello

arXiv.org Artificial IntelligenceAug-2-2022

Sockeye 3 is the latest version of the Sockeye toolkit for Neural Machine Translation (NMT). Now based on PyTorch, Sockeye 3 provides faster model implementations and more advanced features with a further streamlined codebase. This enables broader experimentation with faster iteration, efficient training of stronger and faster models, and the flexibility to move new ideas quickly from research to production. When running comparable models, Sockeye 3 is up to 126% faster than other PyTorch implementations on GPUs and up to 292% faster on CPUs. Sockeye 3 is open source software released under the Apache 2.0 license.

computational linguistic, proceedings, translation, (11 more...)

arXiv.org Artificial Intelligence

2207.05851

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Europe > Germany > Berlin (0.04)
(7 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On the Pitfalls of Analyzing Individual Neurons in Language Models

Antverg, Omer, Belinkov, Yonatan

arXiv.org Artificial IntelligenceAug-1-2022

While many studies have shown that linguistic information is encoded in hidden word representations, few have studied individual neurons, to show how and in which neurons it is encoded. Among these, the common approach is to use an external probe to rank neurons according to their relevance to some linguistic attribute, and to evaluate the obtained ranking using the same probe that produced it. We show two pitfalls in this methodology: 1. We separate them and draw conclusions on each. We show that these are not the same. We compare two recent ranking methods and a simple one we introduce, and evaluate them with regard to both of these aspects. Many studies attempt to interpret language models by predicting different linguistic properties from word representations, an approach called probing classifiers (Adi et al., 2017; Conneau et al., 2018, inter alia). A growing body of work focuses on individual neurons within the representation, attempting to show in which neurons some information is encoded, and whether it is localized (concentrated in a small set of neurons) or dispersed. Such knowledge may allow us to control the model's output (Bau et al., 2019), to reduce the number of parameters in the model (Voita et al., 2019; Sajjad et al., 2020), and to gain a general scientific knowledge of the model. The common methodology is to train a probe to predict some linguistic attribute from a representation, and to use it, in different ways, to rank the neurons of the representation according to their importance for the attribute in question.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2110.07483

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Israel (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(9 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback