Machine Translation
How to Measure Gender Bias in Machine Translation: Optimal Translators, Multiple Reference Points
In this paper--as a case study--we present a systematic study of gender bias in machine translation with Google Translate. We translated sentences containing names of occupations from Hungarian, a language with gender-neutral pronouns, into English. Our aim was to present a fair measure for bias by comparing the translations to an optimal non-biased translator. When assessing bias, we used the following reference points: (1) the distribution of men and women among occupations in both the source and the target language countries, as well as (2) the results of a Hungarian survey that examined if certain jobs are generally perceived as feminine or masculine. We also studied how expanding sentences with adjectives referring to occupations effect the gender of the translated pronouns. As a result, we found bias against both genders, but biased results against women are much more frequent. Translations are closer to our perception of occupations than to objective occupational statistics. Finally, occupations have a greater effect on translation than adjectives.
On learning language-invariant representations for universal machine translation
Despite the recent improvements in neural machine translation (NMT), training a large NMT model with hundreds of millions of parameters usually requires a collection of parallel corpora at a large scale, on the order of millions or even billions of aligned sentences for supervised training (Arivazhagan et al.). While it might be possible to automatically crawl the web to collect parallel sentences for high-resource language pairs, such as German-English and French-English, it is often infeasible or expensive to manually translate large amounts of sentences for low-resource language pairs, such as Nepali-English, Sinhala-English, etc. To this end, the goal of the so-called multilingual universal machine translation, a.k.a., universal machine translation (UMT), is to learn to translate between any pair of languages using a single system, given pairs of translated documents for some of these languages. The hope is that by learning a shared "semantic space" between multiple source and target languages, the model can leverage language-invariant structure from high-resource translation pairs to transfer to the translation between low-resource language pairs, or even enable zero-shot translation. Indeed, training such a single massively multilingual model has gained impressive empirical results, especially in the case of low-resource language pairs (see Figure 1).
gordicaleksa/pytorch-original-transformer
This repo contains PyTorch implementation of the original transformer paper ( Vaswani et al.). It's aimed at making it easy to start playing and learning about transformers. Important note: I'll be adding a jupyter notebook soon as well! Transformers were originally proposed by Vaswani et al. in a seminal paper called Attention Is All You Need. You probably heard of transformers one way or another. GPT-3 and BERT to name a few well known ones .
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations
Han, Lifeng, Jones, Gareth, Smeaton, Alan
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT models for comparisons namely: Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the noise removal, translation post editing and MWE annotation by human professionals, we believe our AlphaMWE dataset will be an asset for cross-lingual and multilingual research, such as MT and information extraction. Our multilingual corpora are available as open access at github.com/poethan/AlphaMWE.
Template Controllable keywords-to-text Generation
Mishra, Abhijit, Chowdhury, Md Faisal Mahbub, Manohar, Sagar, Gutfreund, Dan, Sankaranarayanan, Karthik
This paper proposes a novel neural model for the understudied task of generating text from keywords. The model takes as input a set of un-ordered keywords, and part-of-speech (POS) based template instructions. This makes it ideal for surface realization in any NLG setup. The framework is based on the encode-attend-decode paradigm, where keywords and templates are encoded first, and the decoder judiciously attends over the contexts derived from the encoded keywords and templates to generate the sentences. Training exploits weak supervision, as the model trains on a large amount of labeled data with keywords and POS based templates prepared through completely automatic means. Qualitative and quantitative performance analyses on publicly available test-data in various domains reveal our system's superiority over baselines, built using state-of-the-art neural machine translation and controllable transfer techniques. Our approach is indifferent to the order of input keywords.
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
Nekoto, Wilhelmina, Marivate, Vukosi, Matsila, Tshinondiwa, Fasubaa, Timi, Kolawole, Tajudeen, Fagbohungbe, Taiwo, Akinola, Solomon Oluwole, Muhammad, Shamsuddeen Hassan, Kabongo, Salomon, Osei, Salomey, Freshia, Sackey, Niyongabo, Rubungo Andre, Macharm, Ricky, Ogayo, Perez, Ahia, Orevaoghene, Meressa, Musie, Adeyemi, Mofe, Mokgesi-Selinga, Masabata, Okegbemi, Lawrence, Martinus, Laura Jane, Tajudeen, Kolawole, Degila, Kevin, Ogueji, Kelechi, Siminyu, Kathleen, Kreutzer, Julia, Webster, Jason, Ali, Jamiil Toure, Abbott, Jade, Orife, Iroro, Ezeani, Ignatius, Dangana, Idris Abdulkabir, Kamper, Herman, Elsahar, Hady, Duru, Goodness, Kioko, Ghollah, Murhabazi, Espoir, van Biljon, Elan, Whitenack, Daniel, Onyefuluchi, Christopher, Emezue, Chris, Dossou, Bonaventure, Sibanda, Blessing, Bassey, Blessing Itoro, Olabiyi, Ayodele, Ramkilowan, Arshath, Öktem, Alp, Akinfaderin, Adewale, Bashir, Abdallah
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.
Evolution and Future of Natural Language Processing (NLP)
Natural Language Processing (NLP) a subset technique of Artificial Intelligence which is used to narrow the communication gap between the Computer and Human. It is originated from the idea of Machine Translation (MT) which came to existence during the second world war. The primary idea was to convert one human language to another human language, for example, turning the Russian language to English language using the brain of the Computers but after that, the thought of conversion of human language to computer language and vice-versa emerged, so that communication with the machine became easy. In simple words, a language can be understood as a group of rules or symbol. These symbols are integrated and then used for transmitting as well as broadcasting the information.
Translating lost languages using machine learning
Recent research suggests that most languages that have ever existed are no longer spoken. Dozens of these dead languages are also considered to be lost, or "undeciphered" -- that is, we don't know enough about their grammar, vocabulary, or syntax to be able to actually understand their texts. Lost languages are more than a mere academic curiosity; without them, we miss an entire body of knowledge about the people who spoke them. Unfortunately, most of them have such minimal records that scientists can't decipher them by using machine-translation algorithms like Google Translate. Some don't have a well-researched "relative" language to be compared to, and often lack traditional dividers like white space and punctuation.
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Zhou, Chunting, Gu, Jiatao, Diab, Mona, Guzman, Paco, Zettlemoyer, Luke, Ghazvininejad, Marjan
Neural sequence models can generate highly fluent sentences but recent studies have also shown that they are also prone to hallucinate additional content not supported by the input, which can cause a lack of trust in the model. To better assess the faithfulness of the machine outputs, we propose a new task to predict whether each token in the output sequence is hallucinated conditioned on the source input, and collect new manually annotated evaluation sets for this task. We also introduce a novel method for learning to model hallucination detection, based on pretrained language models fine tuned on synthetic data that includes automatically inserted hallucinations. Experiments on machine translation and abstract text summarization demonstrate the effectiveness of our proposed approach -- we obtain an average F1 of around 0.6 across all the benchmark datasets and achieve significant improvements in sentence-level hallucination scoring compared to baseline methods. We also release our annotated data and code for future research at https://github.com/violet-zct/fairseq-detect-hallucination.
TransQuest: Translation Quality Estimation with Cross-lingual Transformers
Ranasinghe, Tharindu, Orasan, Constantin, Mitkov, Ruslan
Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.