Goto

Collaborating Authors

 Machine Translation


Researchers claim that AI-translated text is less 'lexically' rich than human translations

#artificialintelligence

Human interpreters make choices unique to them, consciously or unconsciously, when translating one language into another. They might explicate, normalize, or condense and summarize, creating fingerprints known informally as "translationese." In machine learning, generating accurate translations has been the main objective thus far. But this might be coming at the expense of translation richness and diversity. In a new study, researchers at Tilburg University and the University of Maryland attempt to quantify the lexical and grammatical diversity of "machine translationese" -- i.e., the fingerprints made by AI translation algorithms.



Controlling Hallucinations at Word Level in Data-to-Text Generation

arXiv.org Artificial Intelligence

Data-to-Text Generation (DTG) is a subfield of Natural Language Generation aiming at transcribing structured data in natural language descriptions. The field has been recently boosted by the use of neural-based generators which exhibit on one side great syntactic skills without the need of hand-crafted pipelines; on the other side, the quality of the generated text reflects the quality of the training data, which in realistic settings only offer imperfectly aligned structure-text pairs. Consequently, state-of-art neural models include misleading statements - usually called hallucinations - in their outputs. The control of this phenomenon is today a major challenge for DTG, and is the problem addressed in the paper. Previous work deal with this issue at the instance level: using an alignment score for each table-reference pair. In contrast, we propose a finer-grained approach, arguing that hallucinations should rather be treated at the word level. Specifically, we propose a Multi-Branch Decoder which is able to leverage word-level labels to learn the relevant parts of each training instance. These labels are obtained following a simple and efficient scoring procedure based on co-occurrence analysis and dependency parsing. Extensive evaluations, via automated metrics and human judgment on the standard WikiBio benchmark, show the accuracy of our alignment labels and the effectiveness of the proposed Multi-Branch Decoder. Our model is able to reduce and control hallucinations, while keeping fluency and coherence in generated texts. Further experiments on a degraded version of ToTTo show that our model could be successfully used on very noisy settings.


Bootstrapping Multilingual AMR with Contextual Word Alignments

arXiv.org Artificial Intelligence

We develop high performance multilingualAbstract Meaning Representation (AMR) sys-tems by projecting English AMR annotationsto other languages with weak supervision. Weachieve this goal by bootstrapping transformer-based multilingual word embeddings, in partic-ular those from cross-lingual RoBERTa (XLM-R large). We develop a novel technique forforeign-text-to-English AMR alignment, usingthe contextual word alignment between En-glish and foreign language tokens. This wordalignment is weakly supervised and relies onthe contextualized XLM-R word embeddings.We achieve a highly competitive performancethat surpasses the best published results forGerman, Italian, Spanish and Chinese.


Machine Translationese: Effects of Algorithmic Bias on Linguistic Complexity in Machine Translation

arXiv.org Artificial Intelligence

Recent studies in the field of Machine Translation (MT) and Natural Language Processing (NLP) have shown that existing models amplify biases observed in the training data. The amplification of biases in language technology has mainly been examined with respect to specific phenomena, such as gender bias. In this work, we go beyond the study of gender in MT and investigate how bias amplification might affect language in a broader sense. We hypothesize that the 'algorithmic bias', i.e. an exacerbation of frequently observed patterns in combination with a loss of less frequent ones, not only exacerbates societal biases present in current datasets but could also lead to an artificially impoverished language: 'machine translationese'. We assess the linguistic richness (on a lexical and morphological level) of translations created by different data-driven MT paradigms - phrase-based statistical (PB-SMT) and neural MT (NMT). Our experiments show that there is a loss of lexical and morphological richness in the translations produced by all investigated MT paradigms for two language pairs (EN<=>FR and EN<=>ES).


Taxonomic survey of Hindi Language NLP systems

arXiv.org Artificial Intelligence

The field of Natural language processing can be formally defined as - "A theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications"[69]. The naturally occurring text can be in written or spoken form.A wide array of domains contribute to NLP development like linguistics, computer science and psychology.The linguistics field helps to understand the formal structure of language while computer science domain helps to find efficient internal representations and data structures.The study of "Psychology" can be useful to understand the methodology used by humans for dealing with languages. NLP can be considered to be having two distinct focus namely (1)Natural Language Generation(NLG) and (2)Natural Language Understanding(NLU). The NLG deals with planning to use the representation of language to decide what should be generated at each point in interaction, while NLU needs to analyze language and decide which is best way to represent it meaningfully.We, in this survey paper, concentrate on area of NLU for written text.Hence the NLP henceforth might be considered as NLU and vice versa. Motivation for designing Indian NLP systems Hindi and English are the official languages in central government of India(GOI). Indian community faces a "Digital Divide" due to dominance of English as mode of communication in higher education, judiciary, corporate sector and Public administration at Central level whereas the government in states work in their respective regional languages [67].The expansion of Internet has inter-connected the socioeconomic environment of the world and redefined the concept of global culture.As per a report in 2017 by the companies kpmg and Google


Disembodied Machine Learning: On the Illusion of Objectivity in NLP

arXiv.org Artificial Intelligence

Machine Learning seeks to identify and encode bodies of knowledge within provided datasets. However, data encodes subjective content, which determines the possible outcomes of the models trained on it. Because such subjectivity enables marginalisation of parts of society, it is termed (social) `bias' and sought to be removed. In this paper, we contextualise this discourse of bias in the ML community against the subjective choices in the development process. Through a consideration of how choices in data and model development construct subjectivity, or biases that are represented in a model, we argue that addressing and mitigating biases is near-impossible. This is because both data and ML models are objects for which meaning is made in each step of the development pipeline, from data selection over annotation to model training and analysis. Accordingly, we find the prevalent discourse of bias limiting in its ability to address social marginalisation. We recommend to be conscientious of this, and to accept that de-biasing methods only correct for a fraction of biases.


Exploring multi-task multi-lingual learning of transformer models for hate speech and offensive speech identification in social media

arXiv.org Artificial Intelligence

Thus, social media platforms are often held responsible for framing the views and opinions of a large number of people (Duggan et al., 2017). However, this freedom to voice our opinion has been challenged by the increase in the use of hate speech (Mondal et al., 2017). The anonymity of the internet grants people the power to completely change the context of a discussion and suppress a person's personal opinion (Sticca and Perren, 2013). These hateful posts and comments not only affect the society at a micro scale but also at a global level by influencing people's views regarding important global events like elections, and protests (Duggan et al., 2017). Given the volume of online communication happening on various social media platforms and the need for more fruitful communication, there is a growing need to automate the detection of hate speech. For the scope of this paper we adopt the definition of hate speech and offensive speech as defined in the Mandl et al. (2019) as "insulting, hurtful, derogatory, or obscene content directed from one person to another person" (quoted from (Mandl et al., 2019)). In order to automate hate speech detection the Natural Language Processing (NLP) community has made significant progress which has been accelerated by organization of numerous shared tasks aimed at identifying hate speech (Mandl et al., 2019; Kumar et al., 2020, 2018).


Unanswerable Questions about Images and Texts

arXiv.org Artificial Intelligence

It will be useful to setting up a general, abstract framework in which to discuss these issues. Generally speaking AI systems, and for that matter computer programs of any kind for a particular task, the actual ultimate objective can be formulated as follows. There is a class X of inputs that are "reasonable" problems for Q. There is a class Y of possible outputs. The task defines a relation Q(x, y) meaning "y is a good output [or an acceptable output, or the best possible output] on the task for input x." We assume that for every x X there is at least one y Y such that Q(x, y).


Fast Sequence Generation with Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

Autoregressive sequence Generation models have achieved state-of-the-art performance in areas like machine translation and image captioning. These models are autoregressive in that they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. Recently, non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel. Typically, these models use the word-level cross-entropy loss to optimize each word independently. However, such a learning process fails to consider the sentence-level consistency, thus resulting in inferior generation quality of these non-autoregressive models. In this paper, we propose a simple and efficient model for Non-Autoregressive sequence Generation (NAG) with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL). CMAL formulates NAG as a multi-agent reinforcement learning system where element positions in the target sequence are viewed as agents that learn to cooperatively maximize a sentence-level reward. On MSCOCO image captioning benchmark, our NAG method achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup. On WMT14 EN-DE machine translation dataset, our method outperforms cross-entropy trained baseline by 6.0 BLEU points while achieves the greatest decoding speedup of 17.46x.