Collaborating Authors

target language

Deep learning to translate between programming languages


Migrating a codebase from an archaic programming language such as COBOL to a modern alternative like Java or C is a difficult, resource-intensive task that requires expertise in both the source and target languages. COBOL, for example, is still widely used today in mainframe systems around the world, so companies, governments, and others often must choose whether to manually translate their code bases or commit to maintaining code written in a language that dates back to the 1950s. We've developed TransCoder, an entirely self-supervised neural transcompiler system that can make code migration far easier and more efficient. Our method is the first AI system able to translate code from one programming language to another without requiring parallel data for training. We've demonstrated that TransCoder can successfully translate functions between C, Java, and Python 3. TransCoder outperforms open source and commercial rule-based translation programs.

Java To Python And Back, AI That Translates Programming Languages


The Commonwealth Bank of Australia spent around $750 million and 5 years of work to convert its platform from COBOL to Java. Migrating an existing codebase to a modern or more efficient language like Java or C requires expertise in both the source and target languages, and is often costly. Usually, a transcompiler is deployed that converts source code from a high-level programming language (such as C or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree.

Convolutional Neural Networks and a Transfer Learning Strategy to Classify Parkinson's Disease from Speech in Three Different Languages Machine Learning

Parkinson's disease patients develop different speech impairments that affect their communication capabilities. The automatic assessment of the speech of the patients allows the development of computer aided tools to support the diagnosis and the evaluation of the disease severity. This paper introduces a methodology to classify Parkinson's disease from speech in three different languages: Spanish, German, and Czech. The proposed approach considers convolutional neural networks trained with time frequency representations and a transfer learning strategy among the three languages. The transfer learning scheme aims to improve the accuracy of the models when the weights of the neural network are initialized with utterances from a different language than the used for the test set. The results suggest that the proposed strategy improves the accuracy of the models in up to 8\% when the base model used to initialize the weights of the classifier is robust enough. In addition, the results obtained after the transfer learning are in most cases more balanced in terms of specificity-sensitivity than those trained without the transfer learning strategy.

Neural Machine Translation System of Indic Languages -- An Attention based Approach Machine Learning

Neural machine translation (NMT) is a recent and effective technique which led to remarkable improvements in comparison of conventional machine translation techniques. Proposed neural machine translation model developed for the Gujarati language contains encoder-decoder with attention mechanism. In India, almost all the languages are originated from their ancestral language - Sanskrit. They are having inevitable similarities including lexical and named entity similarity. Translating into Indic languages is always be a challenging task. In this paper, we have presented the neural machine translation system (NMT) that can efficiently translate Indic languages like Hindi and Gujarati that together covers more than 58.49 percentage of total speakers in the country. We have compared the performance of our NMT model with automatic evaluation matrices such as BLEU, perplexity and TER matrix. The comparison of our network with Google translate is also presented where it outperformed with a margin of 6 BLEU score on English-Gujarati translation.

A Comprehensive Survey of Multilingual Neural Machine Translation Artificial Intelligence

We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years. MNMT has been useful in improving translation quality as a result of translation knowledge transfer (transfer learning). MNMT is more promising and interesting than its statistical machine translation counterpart because end-to-end modeling and distributed representations open new avenues for research on machine translation. Many approaches have been proposed in order to exploit multilingual parallel corpora for improving translation quality. However, the lack of a comprehensive survey makes it difficult to determine which approaches are promising and hence deserve further exploration. In this paper, we present an in-depth survey of existing literature on MNMT. We first categorize various approaches based on their central use-case and then further categorize them based on resource scenarios, underlying modeling principles, core-issues and challenges. Wherever possible we address the strengths and weaknesses of several techniques by comparing them with each other. We also discuss the future directions that MNMT research might take. This paper is aimed towards both, beginners and experts in NMT. We hope this paper will serve as a starting point as well as a source of new ideas for researchers and engineers interested in MNMT.

Cross-Lingual Ability of Multilingual BERT: An Empirical Study Artificial Intelligence

Recent work has exhibited the surprising cross-lingual abilities of multilingual BERT (M-BERT) -- surprising since it is trained without any cross-lingual objective and with no aligned data. In this work, we provide a comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability. We study the impact of linguistic properties of the languages, the architecture of the model, and the learning objectives. The experimental study is done in the context of three typologically different languages -- Spanish, Hindi, and Russian -- and using two conceptually different NLP tasks, textual entailment and named entity recognition. Among our key conclusions is the fact that the lexical overlap between languages plays a negligible role in the cross-lingual success, while the depth of the network is an integral part of it.

Two Way Adversarial Unsupervised Word Translation Machine Learning

Word translation is a problem in machine translation that seeks to build models that recover word level correspondence between languages. Recent approaches to this problem have shown that word translation models can learned with very small seeding dictionaries, and even without any starting supervision. In this paper we propose a method to jointly find translations between a pair of languages. Not only does our method learn translations in both directions but it improves accuracy of those translations over past methods.

Towards Lingua Franca Named Entity Recognition with BERT Machine Learning

Information extraction is an important task in NLP, enabling the automatic extraction of data for relational database filling. Historically, research and data was produced for English text, followed in subsequent years by datasets in Arabic, Chinese (ACE/OntoNotes), Dutch, Spanish, German (CoNLL evaluations), and many others. The natural tendency has been to treat each language as a different dataset and build optimized models for each. In this paper we investigate a single Named Entity Recognition model, based on a multilingual BERT, that is trained jointly on many languages simultaneously, and is able to decode these languages with better accuracy than models trained only on one language. To improve the initial model, we study the use of regularization strategies such as multitask learning and partial gradient updates. In addition to being a single model that can tackle multiple languages (including code switch), the model could be used to make zero-shot predictions on a new language, even ones for which training data is not available, out of the box. The results show that this model not only performs competitively with monolingual models, but it also achieves state-of-the-art results on the CoNLL02 Dutch and Spanish datasets, OntoNotes Arabic and Chinese datasets. Moreover, it performs reasonably well on unseen languages, achieving state-of-the-art for zero-shot on three CoNLL languages.

Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation Artificial Intelligence

We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, developing our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them.

A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings: Making the Method Robustly Reproducible as Well Machine Learning

In this paper, we reproduce the experiments of Artetxe et al. (2018b) regarding the robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. We show that the reproduction of their method is indeed feasible with some minor assumptions. We further investigate the robustness of their model by introducing four new languages that are less similar to English than the ones proposed by the original paper. In order to assess the stability of their model, we also conduct a grid search over sensible hyperparameters. We then propose key recommendations applicable to any research project in order to deliver fully reproducible research.