Machine Translation
A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
Adelani, David Ifeoluwa, Alabi, Jesujoba Oluwadara, Fan, Angela, Kreutzer, Julia, Shen, Xiaoyu, Reid, Machel, Ruiter, Dana, Klakow, Dietrich, Nabende, Peter, Chang, Ernie, Gwadabe, Tajuddeen, Sackey, Freshia, Dossou, Bonaventure F. P., Emezue, Chris Chinenye, Leong, Colin, Beukman, Michael, Muhammad, Shamsuddeen Hassan, Jarso, Guyo Dub, Yousuf, Oreen, Rubungo, Andre Niyongabo, Hacheme, Gilles, Wairagala, Eric Peter, Nasir, Muhammad Umair, Ajibade, Benjamin Ayoade, Ajayi, Tunde Oluwaseyi, Gitau, Yvonne Wambui, Abbott, Jade, Ahmed, Mohamed, Ochieng, Millicent, Aremu, Anuoluwapo, Ogayo, Perez, Mukiibi, Jonathan, Kabore, Fatoumata Ouoba, Kalipe, Godson Koffi, Mbaye, Derguene, Tapo, Allahsera Auguste, Koagne, Victoire Memdjokam, Munkoh-Buabeng, Edwin, Wagner, Valencia, Abdulmumin, Idris, Awokoya, Ayodele, Buzaaba, Happy, Sibanda, Blessing, Bukula, Andiswa, Manthalu, Sam
Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pre-training? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.
Thought Leaders in Artificial Intelligence: Spence Green, CEO of Lilt (Part 1)
This is a terrific conversation about a SaaS-enabled BPO company, Lilt, in the domain of language translation. Sramana Mitra: Let's start introducing our audience to yourself as well as Lilt. Spence Green: I am the CEO of Lilt. We have two parts of our business. The private sector of our business focuses on creating global customer experiences so that all products and services are available in all languages. We work with enterprises that want to make the user experience in other languages better. Usually, it is as good and personalized as it is in English. We have a public sector business that also works with language. We make it possible for governments to augment the language capabilities that they have primarily for defense and intelligence reasons. These are unified by a common technology that we have built over the past 10 years. This is all done under the mission of making the world's information available irrespective of where you were born or what language you speak.
How Meta Is Making Artificial Intelligence More Inclusive
Artificial intelligence (AI) must be inclusive to reach its potential. AI applications that solve problems for a small segment of the population will fail to achieve widespread adoption. So, it's important that AI applications be designed and prepared with data that reflects as many segments of the global population as possible. Many moving parts need to be managed well to do that, and one of them is language. The more languages an AI application can handle, the more inclusive it is.
Searching for Structure in Unfalsifiable Claims
Christensen, Peter Ebert, Warburg, Frederik, Jia, Menglin, Belongie, Serge
Social media platforms give rise to an abundance of posts and comments on every topic imaginable. Many of these posts express opinions on various aspects of society, but their unfalsifiable nature makes them ill-suited to fact-checking pipelines. In this work, we aim to distill such posts into a small set of narratives that capture the essential claims related to a given topic. Understanding and visualizing these narratives can facilitate more informed debates on social media. As a first step towards systematically identifying the underlying narratives on social media, we introduce PAPYER, a fine-grained dataset of online comments related to hygiene in public restrooms, which contains a multitude of unfalsifiable claims. We present a human-in-the-loop pipeline that uses a combination of machine and human kernels to discover the prevailing narratives and show that this pipeline outperforms recent large transformer models and state-of-the-art unsupervised topic models.
Discourse Cohesion Evaluation for Document-Level Neural Machine Translation
Tan, Xin, Zhang, Longyin, Zhou, Guodong
It is well known that translations generated by an excellent document-level neural machine translation (NMT) model are consistent and coherent. However, existing sentence-level evaluation metrics like BLEU can hardly reflect the model's performance at the document level. To tackle this issue, we propose a Discourse Cohesion Evaluation Method (DCoEM) in this paper and contribute a new test suite that considers four cohesive manners (reference, conjunction, substitution, and lexical cohesion) to measure the cohesiveness of document translations. The evaluation results on recent document-level NMT systems show that our method is practical and essential in estimating translations at the document level.
How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?
Araabi, Ali, Monz, Christof, Niculae, Vlad
Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment n-grams in the training data. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across datasets, a considerable percentage of OOV words are translated incorrectly. Furthermore, we highlight the slightly higher effectiveness of BPE in translating OOV words for special cases, such as named-entities and when the languages involved are linguistically close to each other.
A Bidirectional Tree Tagging Scheme for Joint Medical Relation Extraction
Luo, Xukun, Liu, Weijie, Ma, Meng, Wang, Ping
Joint medical relation extraction refers to extracting triples, composed of entities and relations, from the medical text with a single model. One of the solutions is to convert this task into a sequential tagging task. However, in the existing works, the methods of representing and tagging the triples in a linear way failed to the overlapping triples, and the methods of organizing the triples as a graph faced the challenge of large computational effort. In this paper, inspired by the tree-like relation structures in the medical text, we propose a novel scheme called Bidirectional Tree Tagging (BiTT) to form the medical relation triples into two two binary trees and convert the trees into a word-level tags sequence. Based on BiTT scheme, we develop a joint relation extraction model to predict the BiTT tags and further extract medical triples efficiently. Our model outperforms the best baselines by 2.0\% and 2.5\% in F1 score on two medical datasets. What's more, the models with our BiTT scheme also obtain promising results in three public datasets of other domains.
Reproduction and Replication of an Adversarial Stylometry Experiment
Wang, Haining, Juola, Patrick, Riddell, Allen
Maintaining anonymity while communicating using natural language remains a challenge. Standard authorship attribution techniques that analyze candidate authors' writing styles achieve uncomfortably high accuracy even when the number of candidate authors is high. Adversarial stylometry defends against authorship attribution with the goal of preventing unwanted deanonymization. This paper reproduces and replicates experiments in a seminal study of defenses against authorship attribution (Brennan et al., 2012). We are able to successfully reproduce and replicate the original results, although we conclude that the effectiveness of the defenses studied is overstated due to a lack of a control group in the original study. In our replication, we find new evidence suggesting that an entirely automatic method, round-trip translation, merits re-examination as it appears to reduce the effectiveness of established authorship attribution methods.
Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU
Amer, Hossam, Kim, Young Jin, Afify, Mohamed, Matsushita, Hitokazu, Awadallah, Hany Hassan
Multilingual Neural Machine Translation has been showing great success using transformer models. Deploying these models is challenging because they usually require large vocabulary (vocab) sizes for various languages. This limits the speed of predicting the output tokens in the last vocab projection layer. To alleviate these challenges, this paper proposes a fast vocabulary projection method via clustering which can be used for multilingual transformers on GPUs. First, we offline split the vocab search space into disjoint clusters given the hidden context vector of the decoder output, which results in much smaller vocab columns for vocab projection. Second, at inference time, the proposed method predicts the clusters and candidate active tokens for hidden context vectors at the vocab projection. This paper also includes analysis of different ways of building these clusters in multilingual settings. Our results show end-to-end speed gains in float16 GPU inference up to 25% while maintaining the BLEU score and slightly increasing memory cost. The proposed method speeds up the vocab projection step itself by up to 2.6x. We also conduct an extensive human evaluation to verify the proposed method preserves the quality of the translations from the original model.
Root-aligned SMILES: A Tight Representation for Chemical Reaction Prediction
Zhong, Zipeng, Song, Jie, Feng, Zunlei, Liu, Tiantao, Jia, Lingxiang, Yao, Shaolun, Wu, Min, Hou, Tingjun, Song, Mingli
Chemical reaction prediction, involving forward synthesis and retrosynthesis prediction, is a fundamental problem in organic synthesis. A popular computational paradigm formulates synthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES is adopted for molecule representations. However, the general-purpose SMILES neglects the characteristics of chemical reactions, where the molecular graph topology is largely unaltered from reactants to products, resulting in the suboptimal performance of SMILES if straightforwardly applied. In this article, we propose the root-aligned SMILES (R-SMILES), which specifies a tightly aligned one-to-one mapping between the product and the reactant SMILES for more efficient synthesis prediction. Due to the strict one-to-one mapping and reduced edit distance, the computational model is largely relieved from learning the complex syntax and dedicated to learning the chemical knowledge for reactions. We compare the proposed R-SMILES with various state-of-the-art baselines and show that it significantly outperforms them all, demonstrating the superiority of the proposed method.