Machine Translation
Very Low Resource Sentence Alignment: Luhya and Swahili
Chimoto, Everlyn Asiko, Bassett, Bruce A.
Language-agnostic sentence embeddings generated by pre-trained models such as LASER and LaBSE are attractive options for mining large datasets to produce parallel corpora for low-resource machine translation. We test LASER and LaBSE in extracting bitext for two related low-resource African languages: Luhya and Swahili. For this work, we created a new parallel set of nearly 8000 Luhya-English sentences which allows a new zero-shot test of LASER and LaBSE. We find that LaBSE significantly outperforms LASER on both languages. Both LASER and LaBSE however perform poorly at zero-shot alignment on Luhya, achieving just 1.5% and 22.0% successful alignments respectively (P@1 score). We fine-tune the embeddings on a small set of parallel Luhya sentences and show significant gains, improving the LaBSE alignment accuracy to 53.3%. Further, restricting the dataset to sentence embedding pairs with cosine similarity above 0.7 yielded alignments with over 85% accuracy.
How tech is helping us talk to animals
The world around us is vibrating with sounds we cannot hear. Bats chitter and babble in ultrasound; elephants rumble infrasonic secrets to each other; coral reefs are aquatic clubs, hopping with the cracks and hisses and clicks of marine life. For centuries, we didn't even know those sounds existed. But as technology has advanced, so has our capacity to listen. Today, tools like drones, digital recorders, and artificial intelligence are helping us listen to the sounds of nature in unprecedented ways, transforming the world of scientific research and raising a tantalizing prospect: Someday soon, computers might allow us to talk to animals. In some ways, that has already begun.
DiffusER: Discrete Diffusion via Edit-based Reconstruction
Reid, Machel, Hellendoorn, Vincent J., Neubig, Graham
In text generation, models that generate text from scratch one token at a time are currently the dominant paradigm. Despite being performant, these models lack the ability to revise existing text, which limits their usability in many practical scenarios. We look to address this, with DiffusER (Diffusion via Edit-based Reconstruction), a new edit-based generative model for text based on denoising diffusion models -- a class of models that use a Markov chain of denoising steps to incrementally generate data. DiffusER is not only a strong generative model in general, rivalling autoregressive models on several tasks spanning machine translation, summarization, and style transfer; it can also perform other varieties of generation that standard autoregressive models are not well-suited for. For instance, we demonstrate that DiffusER makes it possible for a user to condition generation on a prototype, or an incomplete sequence, and continue revising based on previous edit steps.
Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities
Chandu, Khyathi Raghavi, Geramifard, Alborz
Contextualizing language technologies beyond a single language kindled embracing multiple modalities and languages. Individually, each of these directions undoubtedly proliferated into several NLP tasks. Despite this momentum, most of the multimodal research is primarily centered around English and multilingual research is primarily centered around contexts from text modality. Challenging this conventional setup, researchers studied the unification of multilingual and multimodal (MultiX) streams. The main goal of this work is to catalogue and characterize these works by charting out the categories of tasks, datasets and methods to address MultiX scenarios. To this end, we review the languages studied, gold or silver data with parallel annotations, and understand how these modalities and languages interact in modeling. We present an account of the modeling approaches along with their strengths and weaknesses to better understand what scenarios they can be used reliably. Following this, we present the high-level trends in the overall paradigm of the field. Finally, we conclude by presenting a road map of challenges and promising research directions.
Counterfactual Data Augmentation via Perspective Transition for Open-Domain Dialogues
Ou, Jiao, Zhang, Jinchao, Feng, Yang, Zhou, Jie
The construction of open-domain dialogue systems requires high-quality dialogue datasets. The dialogue data admits a wide variety of responses for a given dialogue history, especially responses with different semantics. However, collecting high-quality such a dataset in most scenarios is labor-intensive and time-consuming. In this paper, we propose a data augmentation method to automatically augment high-quality responses with different semantics by counterfactual inference. Specifically, given an observed dialogue, our counterfactual generation model first infers semantically different responses by replacing the observed reply perspective with substituted ones. Furthermore, our data selection method filters out detrimental augmented responses. Experimental results show that our data augmentation method can augment high-quality responses with different semantics for a given dialogue history, and can outperform competitive baselines on multiple downstream tasks.
Improving Bilingual Lexicon Induction with Cross-Encoder Reranking
Li, Yaoyiran, Liu, Fangyu, Vuliฤ, Ivan, Korhonen, Anna
Bilingual lexicon induction (BLI) with limited bilingual supervision is a crucial yet challenging task in multilingual NLP. Current state-of-the-art BLI methods rely on the induction of cross-lingual word embeddings (CLWEs) to capture cross-lingual word similarities; such CLWEs are obtained 1) via traditional static models (e.g., VecMap), or 2) by extracting type-level CLWEs from multilingual pretrained language models (mPLMs), or 3) through combining the former two options. In this work, we propose a novel semi-supervised post-hoc reranking method termed BLICEr (BLI with Cross-Encoder Reranking), applicable to any precalculated CLWE space, which improves their BLI capability. The key idea is to 'extract' cross-lingual lexical knowledge from mPLMs, and then combine it with the original CLWEs. This crucial step is done via 1) creating a word similarity dataset, comprising positive word pairs (i.e., true translations) and hard negative pairs induced from the original CLWE space, and then 2) fine-tuning an mPLM (e.g., mBERT or XLM-R) in a cross-encoder manner to predict the similarity scores. At inference, we 3) combine the similarity score from the original CLWE space with the score from the BLI-tuned cross-encoder. BLICEr establishes new state-of-the-art results on two standard BLI benchmarks spanning a wide spectrum of diverse languages: it substantially outperforms a series of strong baselines across the board. We also validate the robustness of BLICEr with different CLWEs.
Leveraging Locality in Abstractive Text Summarization
Liu, Yixin, Ni, Ansong, Nan, Linyong, Deb, Budhaditya, Zhu, Chenguang, Awadallah, Ahmed H., Radev, Dragomir
Neural attention models have achieved significant improvements on many natural language processing tasks. However, the quadratic memory complexity of the self-attention module with respect to the input length hinders their applications in long text summarization. Instead of designing more efficient attention modules, we approach this problem by investigating if models with a restricted context can have competitive performance compared with the memory-efficient attention models that maintain a global context by treating the input as a single sequence. Our model is applied to individual pages which contain parts of inputs grouped by the principle of locality during both encoding and decoding. We empirically investigated three kinds of locality in text summarization at different levels of granularity, ranging from sentences to documents. Our experimental results show that our model has a better performance compared with strong baselines with efficient attention modules, and our analysis provides further insights into our locality-aware modeling strategy.
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation
Wei, Kun, Zhou, Long, Zhang, Ziqiang, Chen, Liping, Liu, Shujie, He, Lei, Li, Jinyu, Wei, Furu
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, Speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed Speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that Speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models1.
Artificial Intelligence and Localization
Artificial intelligence is a game changer for the content industry in general and the language industry in particular, in terms of both technology and services. Unsurprisingly, it has had a vast and continued impact on localization people, processes, and technology. In the early days of artificial intelligence (AI), machine learning (ML), and machine translation (MT) offered a basic and automatic conversion of text, such as glorified versions of online dictionaries and glossaries. ML and MT are now propelled by neural networks that bring them closer to functioning and behaving like human brains. Although AI is getting closer to human behavior, it cannot mimic it fully, and therefore, it has not yet replaced human beings.
HashFormers: Towards Vocabulary-independent Pre-trained Transformers
Xue, Huiyin, Aletras, Nikolaos
Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of parameters) and grow linearly with the size of the vocabulary. Previous work on on-device transformers dynamically generate token embeddings on-the-fly without embedding matrices using locality-sensitive hashing over morphological information. These embeddings are subsequently fed into transformer layers for text classification. However, these methods are not pre-trained. Inspired by this line of work, we propose HashFormers, a new family of vocabulary-independent pre-trained transformers that support an unlimited vocabulary (i.e. all possible tokens in a corpus) given a substantially smaller fixed-sized embedding matrix. We achieve this by first introducing computationally cheap hashing functions that bucket together individual tokens to embeddings. We also propose three variants that do not require an embedding matrix at all, further reducing the memory requirements. We empirically demonstrate that HashFormers are more memory efficient compared to standard pre-trained transformers while achieving comparable predictive performance when fine-tuned on multiple text classification tasks. For example, our most efficient HashFormer variant has a negligible performance degradation (0.4\% on GLUE) using only 99.1K parameters for representing the embeddings compared to 12.3-38M parameters of state-of-the-art models.