Machine Translation
The Ecological Footprint of Neural Machine Translation Systems
Shterionov, Dimitar, Vanmassenhove, Eva
Over the past decade, deep learning (DL) has led to significant advancements in various fields of artificial intelligence, including machine translation (MT). These advancements would not be possible without the ever-growing volumes of data and the hardware that allows large DL models to be trained efficiently. Due to the large amount of computing cores as well as dedicated memory, graphics processing units (GPUs) are a more effective hardware solution for training and inference with DL models than central processing units (CPUs). However, the former is very power demanding. The electrical power consumption has economical as well as ecological implications. This chapter focuses on the ecological footprint of neural MT systems. It starts from the power drain during the training of and the inference with neural MT models and moves towards the environment impact, in terms of carbon dioxide emissions. Different architectures (RNN and Transformer) and different GPUs (consumer-grate NVidia 1080Ti and workstation-grade NVidia P100) are compared. Then, the overall CO2 offload is calculated for Ireland and the Netherlands. The NMT models and their ecological impact are compared to common household appliances to draw a more clear picture. The last part of this chapter analyses quantization, a technique for reducing the size and complexity of models, as a way to reduce power consumption. As quantized models can run on CPUs, they present a power-efficient inference solution without depending on a GPU.
Formal Mathematics Statement Curriculum Learning
Polu, Stanislas, Han, Jesse Michael, Zheng, Kunhao, Baksys, Mantas, Babuschkin, Igor, Sutskever, Ilya
We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
Should AI Be Centered on Machine Learning Algorithms or Data?
Arun Shastri, PhD, leads ZS's global AI strategy practice, which spans research, helping clients build their capabilities and platform solutions. In this role, he also oversees analytics services and solutions for several industry sectors. PKS Prakash, PhD is a principal at ZS Associates; he designs and implements advanced data science and AI techniques across multiple verticals including healthcare, hospitality, retail and manufacturing.
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation
Majewska, Olga, Razumovskaia, Evgeniia, Ponti, Edoardo Maria, Vuliฤ, Ivan, Korhonen, Anna
Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, the potential of this technology is not fully realised, as current datasets for multilingual ToD - both for modular and end-to-end modelling - suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing a dialogue by providing instructions about each turn's intents and slots. Through this process we annotate a new large-scale dataset for training and evaluation of multilingual and cross-lingual ToD systems. Our Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding, dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative analyses of COD versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Finally, we benchmark a series of state-of-the-art systems for cross-lingual ToD, setting reference scores for future work and demonstrating that COD prevents over-inflated performance, typically met with prior translation-based ToD datasets.
Use a web browser plugin to quickly translate text with Amazon Translate
Web browsers can be a single pane of glass for organizations to interact with their information--all of the tools can be viewed and accessed on one screen so that users don't have to switch between applications and interfaces. For example, a customer call center might have several different applications to see customer reviews, social media feeds, and customer data. Each one of these applications are interacted with through web browsers. If the information is in a language that the user doesn't speak, however, a separate application often needs to be pulled up to translate text. Web browser plugins enable customization of this user experience.
Best Machine language Translators
Machine language translators have improved a lot over the years. They have become earlier to use and produce accurate translations at cheaper to no cost. For localization translation machine translation services and software have served as a boon. The neural machine translation algorithm makes the delivery of translations natural. Let's take a look at the best machine translation engines in 2022
New voices in AI: David Adelani
Welcome to the first episode of New voices in AI! You can find David on Twitter @davlanade and find out more about Masakhane here. The music used is'Wholesome' by Kevin MacLeod, Licensed under Creative Commons Daly: Hello and welcome to new voices in AI, this a new series from AIhub where we celebrate the voices PhD students, early career researchers, and those with a new perspective on AI. And without further ado, let's begin. First up, a big welcome to our very first guest on "New voices in AI" and if you could introduce yourself, who are you? Adelani: Thank you very much for having me. So, Masakhane is this grassroots organization, whose mission is to strengthen and spur NLP research in African languages, by Africans for Africans, so, and currently the organization we are majorly operating on Slack we already have over 1000 Members. Of course, not everyone is active but we have more than 100 or close to 100 active members as well, yeah. So how did, how did you get into AI?
"Artificial Intelligence" Science-Research, January 2022, Week 3 -- summary from Europe PMC
Background Liver is one of the most typical metastatic sites of colon cancer cells and liver metastasis determines subsequent therapy along with prognosis of patients, particularly in T1 patients. There is still no effective model to predict the danger of LM in T1 CRC patients. Objectives Chest radiographs are commonly performed in emergency units, yet the interpretation calls for radiology experience. Presently, top quality English-Chinese parallel corpus is presently in a phase of shortage. After that, the multilingual dictionary summed up by the translation model is combined with the language model, unsupervised translation model is initialized, unsupervised English-Chinese neural machine translation model is optimized with the back translation technique.
An Empirical Study on the Overlapping Problem of Open-Domain Dialogue Datasets
Wen, Yuqiao, Luo, Guoqing, Mou, Lili
Open-domain dialogue systems aim to converse with humans through text, and its research has heavily relied on benchmark datasets. In this work, we first identify the overlapping problem in DailyDialog and OpenSubtitles, two popular open-domain dialogue benchmark datasets. Our systematic analysis then shows that such overlapping can be exploited to obtain fake state-of-the-art performance. Finally, we address this issue by cleaning these datasets and setting up a proper data processing procedure for future research.
Cost-Effective Training in Low-Resource Neural Machine Translation
Koneru, Sai, Liu, Danni, Niehues, Jan
While Active Learning (AL) techniques are explored in Neural Machine Translation (NMT), only a few works focus on tackling low annotation budgets where a limited number of sentences can get translated. Such situations are especially challenging and can occur for endangered languages with few human annotators or having cost constraints to label large amounts of data. Although AL is shown to be helpful with large budgets, it is not enough to build high-quality translation systems in these low-resource conditions. In this work, we propose a cost-effective training procedure to increase the performance of NMT models utilizing a small number of annotated sentences and dictionary entries. Our method leverages monolingual data with self-supervised objectives and a small-scale, inexpensive dictionary for additional supervision to initialize the NMT model before applying AL. We show that improving the model using a combination of these knowledge sources is essential to exploit AL strategies and increase gains in low-resource conditions. We also present a novel AL strategy inspired by domain adaptation for NMT and show that it is effective for low budgets. We propose a new hybrid data-driven approach, which samples sentences that are diverse from the labelled data and also most similar to unlabelled data. Finally, we show that initializing the NMT model and further using our AL strategy can achieve gains of up to $13$ BLEU compared to conventional AL methods.