Goto

Collaborating Authors

 Machine Translation


Paraphrasing Techniques for Maritime QA system

arXiv.org Artificial Intelligence

There has been an increasing interest in incorporating Artificial Intelligence (AI) into Defence and military systems to complement and augment human intelligence and capabilities. However, much work still needs to be done toward achieving an effective human-machine partnership. This work is aimed at enhancing human-machine communications by developing a capability for automatically translating human natural language into a machine-understandable language (e.g., SQL queries). Techniques toward achieving this goal typically involve building a semantic parser trained on a very large amount of high-quality manually-annotated data. However, in many real-world Defence scenarios, it is not feasible to obtain such a large amount of training data. To the best of our knowledge, there are few works trying to explore the possibility of training a semantic parser with limited manually-paraphrased data, in other words, zero-shot. In this paper, we investigate how to exploit paraphrasing methods for the automated generation of large-scale training datasets (in the form of paraphrased utterances and their corresponding logical forms in SQL format) and present our experimental results using real-world data in the maritime domain.


Lexical Complexity Prediction: An Overview

arXiv.org Artificial Intelligence

Understanding the meaning of words in context is fundamental for reading comprehension. The perceived difficulty, hereafter referred to as complexity, of a target word within a given text varies widely among readers. With an increased demand for distance learning and educational technologies[107], research into automatically predicting which words are likely to cause comprehension problems is becoming a popular area of research [115, 147, 185]. Systems have been created to identify complex words that are difficult to acquire, reproduce, or understand for children [79], second-language learners [89], people suffering from a reading disability, such as dyslexia [131] or aphasia [35, 53], or more generally, individuals with low literacy [59, 175]. In Computational Linguistics and Natural Language Processing (NLP), the task of automatically recognizing complex words is most often achieved by training machine learning (ML) models. These ML models assign a complexity value to each target word within an inputted extract, sentence, or text that allows for the identification of complex words. This information can then be used to improve downstream lexical and text simplification systems that provide simpler alternatives to aid reading comprehension. Take the extract shown in Table 1 for example.


Out-of-Distribution Detection and Selective Generation for Conditional Language Models

arXiv.org Artificial Intelligence

Machine learning algorithms typically assume independent and identically distributed samples in training and at test time. Much work has shown that high-performing ML classifiers can degrade significantly and provide overly-confident, wrong classification predictions, particularly for out-of-distribution (OOD) inputs. Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence, and may suffer even worse degradation on OOD inputs as the prediction is done auto-regressively over many steps. Furthermore, the space of potential low-quality outputs is larger as arbitrary text can be generated and it is important to know when to trust the generated output. We present a highly accurate and lightweight OOD detection method for CLMs, and demonstrate its effectiveness on abstractive summarization and translation. We also show how our method can be used under the common and realistic setting of distribution shift for selective generation (analogous to selective prediction for classification) of high-quality outputs, while automatically abstaining from low-quality ones, enabling safer deployment of generative language models.


GATE: A Challenge Set for Gender-Ambiguous Translation Examples

arXiv.org Artificial Intelligence

Although recent years have brought significant progress in improving translation of unambiguously gendered sentences, translation of ambiguously gendered input remains relatively unexplored. When source gender is ambiguous, machine translation models typically default to stereotypical gender roles, perpetuating harmful bias. Recent work has led to the development of "gender rewriters" that generate alternative gender translations on such ambiguous inputs, but such systems are plagued by poor linguistic coverage. To encourage better performance on this task we present and release GATE, a linguistically diverse corpus of gender-ambiguous source sentences along with multiple alternative target language translations. We also provide tools for evaluation and system analysis when using GATE and use them to evaluate our translation rewriter system.


Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

arXiv.org Artificial Intelligence

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}.


CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

arXiv.org Artificial Intelligence

Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at https://github.com/RosenZhang/CroCoSum.


On the Importance of Sign Labeling: The Hamburg Sign Language Notation System Case Study

arXiv.org Artificial Intelligence

Labeling is the cornerstone of supervised machine learning, which has been exploited in a plethora of various applications, with sign language recognition being one of them. However, such algorithms must be fed with a huge amount of consistently labeled data during the training process to elaborate a well-generalizing model. In addition, there is a great need for an automated solution that works with any nationally diversified sign language. Although there are language-agnostic transcription systems, such as the Hamburg Sign Language Notation System (HamNoSys) that describe the signer's initial position and body movement instead of the glosses' meanings, there are still issues with providing accurate and reliable labels for every real-world use case. In this context, the industry relies heavily on manual attribution and labeling of the available video data. In this work, we tackle this issue and thoroughly analyze the HamNoSys labels provided by various maintainers of open sign language corpora in five sign languages, in order to examine the challenges encountered in labeling video data. We also investigate the consistency and objectivity of HamNoSys-based labels for the purpose of training machine learning models. Our findings provide valuable insights into the limitations of the current labeling methods and pave the way for future research on developing more accurate and efficient solutions for sign language recognition.


The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

arXiv.org Artificial Intelligence

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM)(BigScience Workshop, 2022) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.


The Tech That Helped Me Through My Dad's Death--and in Some Ways, Keeps Him Alive

Slate

When my dad died, he became part of the cloud. Not the one up high in the sky, but rather an online cumulus that now stores and archives a record of his last 18 months on earth. On my laptop, and even more prominently on my phone, I carry with me digital traces of my dad that I can't yet bring myself to access. Four years after his death, I still sit with a kind of grief that remains more raw than residual, and his memory lingers in digital purgatory--undeleted yet untouched; saved but not sought. He "lives" in this liminal digital space; like a grave I can't yet bring myself to visit, but simply know is there.


AI translation firm unveils 'world-first' timeline to singularity

#artificialintelligence

An Italian company has unveiled a novel method of measuring AI progress: analyzing improvements in machine translation. Translated, a provider of translation services, used the approach to predict when we will achieve singularity, a vague concept often defined as the point where machines become smarter than humans. The Rome-based business sets this milestone at the moment when AI provides "a perfect translation." According to the new research, this arrives when machine translation (MT) is better than top human translations. Translated's analysis suggests this will happen before the end of the 2020s.