Goto

Collaborating Authors

 Machine Translation


UM4: Unified Multilingual Multiple Teacher-Student Model for Zero-Resource Neural Machine Translation

arXiv.org Artificial Intelligence

Most translation tasks among languages belong to the zero-resource translation problem where parallel corpora are unavailable. Multilingual neural machine translation (MNMT) enables one-pass translation using shared semantic space for all languages compared to the two-pass pivot translation but often underperforms the pivot-based method. In this paper, we propose a novel method, named as Unified Multilingual Multiple teacher-student Model for NMT (UM4). Our method unifies source-teacher, target-teacher, and pivot-teacher models to guide the student model for the zero-resource translation. The source teacher and target teacher force the student to learn the direct source to target translation by the distilled knowledge on both source and target sides. The monolingual corpus is further leveraged by the pivot-teacher model to enhance the student model. Experimental results demonstrate that our model of 72 directions significantly outperforms previous methods on the WMT benchmark.


SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

arXiv.org Artificial Intelligence

Zero/few-shot transfer to unseen services is a critical challenge in task-oriented dialogue research. The Schema-Guided Dialogue (SGD) dataset introduced a paradigm for enabling models to support any service in zero-shot through schemas, which describe service APIs to models in natural language. We explore the robustness of dialogue systems to linguistic variations in schemas by designing SGD-X - a benchmark extending SGD with semantically similar yet stylistically diverse variants for every schema. We observe that two top state tracking models fail to generalize well across schema variants, measured by joint goal accuracy and a novel metric for measuring schema sensitivity. Additionally, we present a simple model-agnostic data augmentation method to improve schema robustness.


Bitext Mining for Low-Resource Languages via Contrastive Learning

arXiv.org Artificial Intelligence

Mining high-quality bitexts for low-resource languages is challenging. This paper shows that sentence representation of language models fine-tuned with multiple negatives ranking loss, a contrastive objective, helps retrieve clean bitexts. Experiments show that parallel data mined from our approach substantially outperform the previous state-of-the-art method on low resource languages Khmer and Pashto.


Error Correction in ASR using Sequence-to-Sequence Models

arXiv.org Artificial Intelligence

Post-editing in Automatic Speech Recognition (ASR) entails automatically correcting common and systematic errors produced by the ASR system. The outputs of an ASR system are largely prone to phonetic and spelling errors. In this paper, we propose to use a powerful pre-trained sequence-to-sequence model, BART, further adaptively trained to serve as a denoising model, to correct errors of such types. The adaptive training is performed on an augmented dataset obtained by synthetically inducing errors as well as by incorporating actual errors from an existing ASR system. We also propose a simple approach to rescore the outputs using word level alignments. Experimental results on accented speech data demonstrate that our strategy effectively rectifies a significant number of ASR errors and produces improved WER results when compared against a competitive baseline. We also highlight a negative result obtained on the related grammatical error correction task in Hindi language showing the limitation in capturing wider context by our proposed model.


MATra: A Multilingual Attentive Transliteration System for Indian Scripts

arXiv.org Artificial Intelligence

Transliteration is a task in the domain of NLP where the output word is a similar-sounding word written using the letters of any foreign language. Today this system has been developed for several language pairs that involve English as either the source or target word and deployed in several places like Google Translate and chatbots. However, there is very little research done in the field of Indic languages transliterated to other Indic languages. This paper demonstrates a multilingual model based on transformers (with some modifications) that can give noticeably higher performance and accuracy than all existing models in this domain and get much better results than state-of-the-art models. This paper shows a model that can perform transliteration between any pair among the following five languages - English, Hindi, Bengali, Kannada and Tamil. It is applicable in scenarios where language is a barrier to communication in any written task. The model beats the state-of-the-art (for all pairs among the five mentioned languages - English, Hindi, Bengali, Kannada, and Tamil) and achieves a top-1 accuracy score of 80.7%, about 29.5% higher than the best current results. Furthermore, the model achieves 93.5% in terms of Phonetic Accuracy (transliteration is primarily a phonetic/sound-based task).


A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

arXiv.org Artificial Intelligence

Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pre-training? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.


Thought Leaders in Artificial Intelligence: Spence Green, CEO of Lilt (Part 1)

#artificialintelligence

This is a terrific conversation about a SaaS-enabled BPO company, Lilt, in the domain of language translation. Sramana Mitra: Let's start introducing our audience to yourself as well as Lilt. Spence Green: I am the CEO of Lilt. We have two parts of our business. The private sector of our business focuses on creating global customer experiences so that all products and services are available in all languages. We work with enterprises that want to make the user experience in other languages better. Usually, it is as good and personalized as it is in English. We have a public sector business that also works with language. We make it possible for governments to augment the language capabilities that they have primarily for defense and intelligence reasons. These are unified by a common technology that we have built over the past 10 years. This is all done under the mission of making the world's information available irrespective of where you were born or what language you speak.


How Meta Is Making Artificial Intelligence More Inclusive

#artificialintelligence

Artificial intelligence (AI) must be inclusive to reach its potential. AI applications that solve problems for a small segment of the population will fail to achieve widespread adoption. So, it's important that AI applications be designed and prepared with data that reflects as many segments of the global population as possible. Many moving parts need to be managed well to do that, and one of them is language. The more languages an AI application can handle, the more inclusive it is.


Searching for Structure in Unfalsifiable Claims

arXiv.org Artificial Intelligence

Social media platforms give rise to an abundance of posts and comments on every topic imaginable. Many of these posts express opinions on various aspects of society, but their unfalsifiable nature makes them ill-suited to fact-checking pipelines. In this work, we aim to distill such posts into a small set of narratives that capture the essential claims related to a given topic. Understanding and visualizing these narratives can facilitate more informed debates on social media. As a first step towards systematically identifying the underlying narratives on social media, we introduce PAPYER, a fine-grained dataset of online comments related to hygiene in public restrooms, which contains a multitude of unfalsifiable claims. We present a human-in-the-loop pipeline that uses a combination of machine and human kernels to discover the prevailing narratives and show that this pipeline outperforms recent large transformer models and state-of-the-art unsupervised topic models.


Discourse Cohesion Evaluation for Document-Level Neural Machine Translation

arXiv.org Artificial Intelligence

It is well known that translations generated by an excellent document-level neural machine translation (NMT) model are consistent and coherent. However, existing sentence-level evaluation metrics like BLEU can hardly reflect the model's performance at the document level. To tackle this issue, we propose a Discourse Cohesion Evaluation Method (DCoEM) in this paper and contribute a new test suite that considers four cohesive manners (reference, conjunction, substitution, and lexical cohesion) to measure the cohesiveness of document translations. The evaluation results on recent document-level NMT systems show that our method is practical and essential in estimating translations at the document level.