Machine Translation
Dong
While parallel corpora are an indispensable resource for data-driven multilingual natural language processing tasks such as machine translation, they are limited in quantity, quality and coverage. As a result, learning translation models from non-parallel corpora has become increasingly important nowadays, especially for low-resource languages. In this work, we propose a joint model for iteratively learning parallel lexicons and phrases from nonparallel corpora. The model is trained using a Viterbi EM algorithm that alternates between constructing parallel phrases using lexicons and updating lexicons based on the constructed parallel phrases. Experiments on Chinese-English datasets show that our approach learns better parallel lexicons and phrases and improves translation performance significantly.
Huang
Computer-aided translation (CAT) system is the most popular tool which helps human translators perform language translation efficiently. To further improve the efficiency, there is an increasing interest in applying the machine translation (MT) technology to upgrade CAT. Post-editing is a standard approach: human translators generate the translation by correcting MT outputs. In this paper, we propose a novel approach deeply integrating MT into CAT systems: a well-designed input method which makes full use of the knowledge adopted by MT systems, such as translation rules, decoding hypotheses and n-best translation lists. Our proposed approach allows human translators to focus on choosing better translation results with less time rather than just complete translation themselves. The extensive experiments demonstrate that our method saves more than 14% time and over 33% keystrokes, and it improves the translation quality as well by more than 3 absolute BLEU scores compared with the strong baseline, i.e., post-editing using Google Pinyin.
Lee
We present the first automatic emotion detection system for Cantonese. This system classifies input text into eight emotion classes: expectancy, joy, love, surprise, anxiety, sorrow, angry, or hate. While a number of emotion corpora and lexica for Mandarin Chinese have been developed, no emotion dataset is available for Cantonese. We leverage existing Mandarin Chinese emotion resources to build the system, with support from Cantonese-Mandarin lexical mappings from a machine translation system, as well as English-Mandarin lexical mappings to handle code-switching in Cantonese input. Evaluation on a set of Cantonese sentences from social media shows promising results.
Alkhatib
The task of transliteration of named entities from one language into another is complicated and considered as one of the challenging tasks in machine translation (MT). To build a well performed transliteration system, we apply well-established techniques based on Hybrid Deep Learning. The system based on convolutional neural network (CNN) followed by Bi-LSTM and CRF. The proposed hybrid mechanism is examined on ANERCorp and Kalimat corpus. The results show that the neural machine translation approach can be employed to build efficient machine transliteration systems achieving state-of-the-art results for Arabic – English language.
Ahmadnia
Neural Machine Translation (NMT) relies heavily on word embeddings, which are continuous representations of words in a vector space, obtained from large monolingual data and, independently, from bilingual data for NMT model training. Word embeddings have proven to be invaluable for performance improvements in natural language analysis tasks that otherwise suffer from data scarcity. This paper defines a new cost function---demonstrated on Farsi-Spanish low-resource attention-based NMT---that encodes word similarity as distances within a word embedding space. The novelty of this cost function is that it encourages our attentional NMT model to generate words that are close to their references in the embedding space. This approach encourages the decoder to select acceptably similar words when potential candidates are found to be Out-Of-Vocabulary (OOV).
The Ecological Footprint of Neural Machine Translation Systems
Shterionov, Dimitar, Vanmassenhove, Eva
Over the past decade, deep learning (DL) has led to significant advancements in various fields of artificial intelligence, including machine translation (MT). These advancements would not be possible without the ever-growing volumes of data and the hardware that allows large DL models to be trained efficiently. Due to the large amount of computing cores as well as dedicated memory, graphics processing units (GPUs) are a more effective hardware solution for training and inference with DL models than central processing units (CPUs). However, the former is very power demanding. The electrical power consumption has economical as well as ecological implications. This chapter focuses on the ecological footprint of neural MT systems. It starts from the power drain during the training of and the inference with neural MT models and moves towards the environment impact, in terms of carbon dioxide emissions. Different architectures (RNN and Transformer) and different GPUs (consumer-grate NVidia 1080Ti and workstation-grade NVidia P100) are compared. Then, the overall CO2 offload is calculated for Ireland and the Netherlands. The NMT models and their ecological impact are compared to common household appliances to draw a more clear picture. The last part of this chapter analyses quantization, a technique for reducing the size and complexity of models, as a way to reduce power consumption. As quantized models can run on CPUs, they present a power-efficient inference solution without depending on a GPU.
Formal Mathematics Statement Curriculum Learning
Polu, Stanislas, Han, Jesse Michael, Zheng, Kunhao, Baksys, Mantas, Babuschkin, Igor, Sutskever, Ilya
We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
Should AI Be Centered on Machine Learning Algorithms or Data?
Arun Shastri, PhD, leads ZS's global AI strategy practice, which spans research, helping clients build their capabilities and platform solutions. In this role, he also oversees analytics services and solutions for several industry sectors. PKS Prakash, PhD is a principal at ZS Associates; he designs and implements advanced data science and AI techniques across multiple verticals including healthcare, hospitality, retail and manufacturing.
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation
Majewska, Olga, Razumovskaia, Evgeniia, Ponti, Edoardo Maria, Vulić, Ivan, Korhonen, Anna
Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, the potential of this technology is not fully realised, as current datasets for multilingual ToD - both for modular and end-to-end modelling - suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing a dialogue by providing instructions about each turn's intents and slots. Through this process we annotate a new large-scale dataset for training and evaluation of multilingual and cross-lingual ToD systems. Our Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding, dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative analyses of COD versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Finally, we benchmark a series of state-of-the-art systems for cross-lingual ToD, setting reference scores for future work and demonstrating that COD prevents over-inflated performance, typically met with prior translation-based ToD datasets.
Use a web browser plugin to quickly translate text with Amazon Translate
Web browsers can be a single pane of glass for organizations to interact with their information--all of the tools can be viewed and accessed on one screen so that users don't have to switch between applications and interfaces. For example, a customer call center might have several different applications to see customer reviews, social media feeds, and customer data. Each one of these applications are interacted with through web browsers. If the information is in a language that the user doesn't speak, however, a separate application often needs to be pulled up to translate text. Web browser plugins enable customization of this user experience.