Analysing Neural Language Models: Contextual Decomposition Reveals Default Reasoning in Number and Gender Assignment Artificial Intelligence

Analysing Neural Language Models: Contextual Decomposition Reveals Default Reasoning in Number and Gender Assignment Jaap Jumelet ILLC, University of Amsterdam Abstract Extensive research has recently shown that recurrent neural language models are able to process a wide range of grammatical phenomena. How these models are able to perform these remarkable feats so well, however, is still an open question. To gain more insight into what information LSTMs base their decisions on, we propose a generalisation of Contextual Decomposition (GCD). In particular, this setup enables us to accurately distil which part of a prediction stems from semantic heuristics, which part truly emanates from syntactic cues and which part arise from the model biases themselves instead. We investigate this technique on tasks pertaining to syntactic agreement and coreference resolution and discover that the model strongly relies on a default reasoning effect to perform these tasks. 1 Introduction Modern language models that use deep learning architectures such as LSTMs, bi-LSTMs and Transformers, have shown enormous gains in performance in the last few years and are finding applications in novel domains, ranging from speech recognition and writing assistance to autonomous generation of fake news. Understanding how they reach their predictions has become a key question for NLP, not only for purely scientific, but also for practical and ethical reasons. From a linguistic perspective, a natural approach is to test the extent to which these models have learned classical linguistic constructs, such as inflectional morphology, constituency structure, agreement between verb and subject, filler-gap dependencies, negative polarity or reflexive anaphora. An influential paper using this approach was presented by Linzen et al. (2016), who investigated the performance of an LSTM-based language model on number agreement.

Semantics, not syntax, creates NLU - Pat Inc - Medium


A scientific hypothesis starts the process of scientific enquiry. False hypotheses can start the path to disaster, as was seen with the geocentric model of the'universe' in which heavenly bodies moved in circular orbits. It became heresy to suggest that orbits aren't circular around the stationary earth, leading to epicycles. It's a good story worth studying in school to appreciate how a hypothesis is critical to validating science. Here's an important hypothesis: "The fundamental aim in the linguistic analysis of a language L is to separate the grammatical sequences which are the sentences of L from the ungrammatical sequences which are not sentences of L and to study the structure of the grammatical sequences."

Anti-efficient encoding in emergent communication Artificial Intelligence

Despite renewed interest in emergent language simulations with neural networks, little is known about the basic properties of the induced code, and how they compare to human language. One fundamental characteristic of the latter, known as Zipf's Law of Abbreviation (ZLA), is that more frequent words are efficiently associated to shorter strings. We study whether the same pattern emerges when two neural networks, a "speaker" and a "listener", are trained to play a signaling game. Surprisingly, we find that networks develop an \emph{anti-efficient} encoding scheme, in which the most frequent inputs are associated to the longest messages, and messages in general are skewed towards the maximum length threshold. This anti-efficient code appears easier to discriminate for the listener, and, unlike in human communication, the speaker does not impose a contrasting least-effort pressure towards brevity. Indeed, when the cost function includes a penalty for longer messages, the resulting message distribution starts respecting ZLA. Our analysis stresses the importance of studying the basic features of emergent communication in a highly controlled setup, to ensure the latter will not strand too far from human language. Moreover, we present a concrete illustration of how different functional pressures can lead to successful communication codes that lack basic properties of human language, thus highlighting the role such pressures play in the latter.

A Neural Network Architecture for Learning Word-Referent Associations in Multiple Contexts Machine Learning

This article proposes a biologically inspired neurocomputational architecture which learns associations between words and referents in different contexts, considering evidence collected from the literature of Psycholinguistics and Neurolinguistics. The multi-layered architecture takes as input raw images of objects (referents) and streams of word's phonemes (labels), builds an adequate representation, recognizes the current context, and associates label with referents incrementally, by employing a Self-Organizing Map which creates new association nodes (prototypes) as required, adjusts the existing prototypes to better represent the input stimuli and removes prototypes that become obsolete/unused. The model takes into account the current context to retrieve the correct meaning of words with multiple meanings. Simulations show that the model can reach up to 78% of word-referent association accuracy in ambiguous situations and approximates well the learning rates of humans as reported by three different authors in five Cross-Situational Word Learning experiments, also displaying similar learning patterns in the different learning conditions.

On the Winograd Schema: Situating Language Understanding in the Data-Information-Knowledge Continuum

AAAI Conferences

The Winograd Schema (WS) challenge, proposed as an alternative to the Turing Test, has become the new standard for evaluating progress in natural language understanding (NLU). In this paper we will not however be concerned with how this challenge might be addressed. Instead, our aim here is threefold: (i) we will fir st formally „situate‟ the WS challenge in the data-information-knowledge continuum, suggesting where in that continuum a good WS resides; (ii) we will show that a WS is just a special case of a more general phenomenon in language understanding, namely the missing text phenomenon (henceforth, MTP) - in particular, we will argue that what we usually call thinking in the process of language understanding involves discovering a significant amount of „missing text‟ - text that is not explicitly stated, but is often implicitly assumed as shared background knowledge; and (iii) we conclude with a brief discussion on why MTP is inconsistent with the data-driven and machine learning approach to language understanding.

Synergies in learning words and their referents

Neural Information Processing Systems

This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these.

Computational and Robotic Models of Early Language Development: A Review Artificial Intelligence

Abstract: We review computational and robotics models of early language learning and development. We first explain why and how these models are used to understand better how children learn language. We argue that they provide concrete theories of language learning as a complex dynamic system, complementing traditional methods in psychology and linguistics. We review different modeling formalisms, grounded in techniques from machine learning and artificial intelligence such as Bayesian and neural network approaches. We then discuss their role in understanding several key mechanisms of language development: cross-situational statistical learning, embodiment, situated social interaction, intrinsically motivated learning, and cultural evolution. We conclude by discussing future challenges for research, including modeling of large-scale empirical data about language acquisition in real-world environments. Language involves a multitude of components interacting in complex ways in parallel ...

On the Winograd Schema Challenge: Levels of Language Understanding and the Phenomenon of the Missing Text Artificial Intelligence

The Winograd Schema (WS) challenge has been proposed as an alternative to the Turing Test as a test for machine intelligence. In this short paper we "situate" the WS challenge in the data-information-knowledge continuum, suggesting in the process what a good WS is. Furthermore, we suggest that the WS is a special case of a more general phenomenon in language understanding, namely the phenomenon of the "missing text". In particular, we will argue that what we usually call thinking in the process of language understanding almost always involves discovering the missing text - text is rarely explicitly stated but is implicitly assumed as shared background knowledge. We therefore suggest extending the WS challenge to include tests beyond those involving reference resolution, including examples that require discovering the missing text in situations that are usually treated in computational linguistics under different labels, such as metonymy, quantifier scope ambiguity, lexical disambiguation, and co-predication, to name a few.

The Computational Metaphor and Artificial Intelligence: A Reflective Examination of a Theoretical Falsework

AI Magazine

AI. Specifically, we address three Just how little can be illustrated by the reaction to Winograd and Flores's (1986) recent book Understanding Computers and Cognition. In personal comments, the book and its authors have been savaged. Published comments are, of course, more temperate (Vellino et al. 1987) but still reveal the hypersensitivity of the Penrose's (1989) even more recent book The Emperor's New Mind have been observed. Like Suchman (1987) and Clancey (1987), we feel that insights of significant value are to be gained from an objective consideration of traditional and alternative perspectives. Some efforts in this direction are evident (Haugeland [1985], Hill [1989], and Born [1987], for example), but the issue requires additional and ongoing attention.

Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction Machine Learning

Recent research in psycholinguistics has provided increasing evidence that humans predict upcoming content. Prediction also affects perception and might be a key to robustness in human language processing. In this paper, we investigate the factors that affect human prediction by building a computational model that can predict upcoming discourse referents based on linguistic knowledge alone vs. linguistic knowledge jointly with common-sense knowledge in the form of scripts. We find that script knowledge significantly improves model estimates of human predictions. In a second study, we test the highly controversial hypothesis that predictability influences referring expression type but do not find evidence for such an effect.