Dyer, Chris
Learning and Evaluating General Linguistic Intelligence
Yogatama, Dani, d'Autume, Cyprien de Masson, Connor, Jerome, Kocisky, Tomas, Chrzanowski, Mike, Kong, Lingpeng, Lazaridou, Angeliki, Ling, Wang, Yu, Lei, Dyer, Chris, Blunsom, Phil
Advances in deep learning techniques (e.g., attention mechanisms, memory modules, and architecture search) have considerably improved natural language processing (NLP) models on many important tasks. For example, machine performance on both Chinese-English machine translation and document question answering on the Stanford question answering dataset (SQuAD; Rajpurkar et al., 2016) has been claimed to have surpassed human levels (Hassan et al., 2018; Devlin et al., 2018). While the tasks that initiated learning-based NLP models were motivated by external demands and are important applications in their own right (e.g., machine translation, question answering, automatic speech recognition, text to speech, etc.), there is a marked and troubling tendency for recent datasets to be set up to be easy to solve with little in the way of generalization or abstraction; for instance, ever larger datasets created by crowd-sourcing processes that may not well approximate the natural distributions they are intended to span, although there are some notable counterexamples (Kwiatkowski et al., 2019). When there exist multiple datasets that are representative of the exact same task from different domains (e.g., various question answering datasets), we rarely evaluate on all of them. This state of affairs promotes development of models that only work well for a specific purpose, overestimates our success at having solved the general task, fails to reward sample efficient generalization that requires the ability to discover and make use of rich linguistic structures, and ultimately limits progress.
Neural Arithmetic Logic Units
Trask, Andrew, Hill, Felix, Reed, Scott E., Rae, Jack, Dyer, Chris, Blunsom, Phil
Neural networks can learn to represent and manipulate numerical information, but they seldom generalize well outside of the range of numerical values encountered during training. To encourage more systematic numerical extrapolation, we propose an architecture that represents numerical quantities as linear activations which are manipulated using primitive arithmetic operators, controlled by learned gates. We call this module a neural arithmetic logic unit (NALU), by analogy to the arithmetic logic unit in traditional processors. Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.
Unsupervised Text Style Transfer using Language Models as Discriminators
Yang, Zichao, Hu, Zhiting, Dyer, Chris, Xing, Eric P., Berg-Kirkpatrick, Taylor
Binary classifiers are employed as discriminators in GAN-based unsupervised style transfer models to ensure that transferred sentences are similar to sentences in the target domain. One difficulty with the binary discriminator is that error signal is sometimes insufficient to train the model to produce rich-structured language. In this paper, we propose a technique of using a target domain language model as the discriminator to provide richer, token-level feedback during the learning process. Because our language model scores sentences directly using a product of locally normalized probabilities, it offers more stable and more useful training signal to the generator. We train the generator to minimize the negative log likelihood (NLL) of generated sentences evaluated by a language model. By using continuous approximation of the discrete samples, our model can be trained using back-propagation in an end-to-end way. Moreover, we find empirically with a language model as a structured discriminator, it is possible to eliminate the adversarial training steps using negative samples, thus making training more stable. We compare our model with previous work using convolutional neural networks (CNNs) as discriminators and show our model outperforms them significantly in three tasks including word substitution decipherment, sentiment modification and related language translation.
Neural Arithmetic Logic Units
Trask, Andrew, Hill, Felix, Reed, Scott E., Rae, Jack, Dyer, Chris, Blunsom, Phil
Neural networks can learn to represent and manipulate numerical information, but they seldom generalize well outside of the range of numerical values encountered during training. To encourage more systematic numerical extrapolation, we propose an architecture that represents numerical quantities as linear activations which are manipulated using primitive arithmetic operators, controlled by learned gates. We call this module a neural arithmetic logic unit (NALU), by analogy to the arithmetic logic unit in traditional processors. Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.
Unsupervised Text Style Transfer using Language Models as Discriminators
Yang, Zichao, Hu, Zhiting, Dyer, Chris, Xing, Eric P., Berg-Kirkpatrick, Taylor
Binary classifiers are often employed as discriminators in GAN-based unsupervised style transfer systems to ensure that transferred sentences are similar to sentences in the target domain. One difficulty with this approach is that the error signal provided by the discriminator can be unstable and is sometimes insufficient to train the generator to produce fluent language. In this paper, we propose a new technique that uses a target domain language model as the discriminator, providing richer and more stable token-level feedback during the learning process. We train the generator to minimize the negative log likelihood (NLL) of generated sentences, evaluated by the language model. By using a continuous approximation of discrete sampling under the generator, our model can be trained using back-propagation in an end-to-end fashion. Moreover, our empirical results show that when using a language model as a structured discriminator, it is possible to forgo adversarial steps during training, making the process more stable. We compare our model with previous work that uses convolutional networks (CNNs) as discriminators, as well as a broad set of other approaches. Results show that the proposed method achieves improved performance on three tasks: word substitution decipherment, sentiment modification, and related language translation.
Sentence Encoding with Tree-constrained Relation Networks
Yu, Lei, d'Autume, Cyprien de Masson, Dyer, Chris, Blunsom, Phil, Kong, Lingpeng, Ling, Wang
The meaning of a sentence is a function of the relations that hold between its words. We instantiate this relational view of semantics in a series of neural models based on variants of relation networks (RNs) which represent a set of objects (for us, words forming a sentence) in terms of representations of pairs of objects. We propose two extensions to the basic RN model for natural language. First, building on the intuition that not all word pairs are equally informative about the meaning of a sentence, we use constraints based on both supervised and unsupervised dependency syntax to control which relations influence the representation. Second, since higher-order relations are poorly captured by a sum of pairwise relations, we use a recurrent extension of RNs to propagate information so as to form representations of higher order relations. Experiments on sentence classification, sentence pair classification, and machine translation reveal that, while basic RNs are only modestly effective for sentence representation, recurrent RNs with latent syntax are a reliably powerful representational device.
Dynamic Integration of Background Knowledge in Neural NLU Systems
Weissenborn, Dirk, Kočiský, Tomáš, Dyer, Chris
Common-sense and background knowledge is required to understand natural language, but in most neural natural language understanding (NLU) systems, this knowledge must be acquired from training corpora during learning, and then it is static at test time. We introduce a new architecture for the dynamic integration of explicit background knowledge in NLU models. A general-purpose reading module reads background knowledge in the form of free-text statements (together with task-specific text inputs) and yields refined word representations to a task-specific NLU architecture that reprocesses the task inputs with these representations. Experiments on document question answering (DQA) and recognizing textual entailment (RTE) demonstrate the effectiveness and flexibility of the approach. Analysis shows that our model learns to exploit knowledge in a semantically appropriate way.
Greedy, Joint Syntactic-Semantic Parsing with Stack LSTMs
Swayamdipta, Swabha, Ballesteros, Miguel, Dyer, Chris, Smith, Noah A.
We present a transition-based parser that jointly produces syntactic and semantic dependencies. It learns a representation of the entire algorithm state, using stack long short-term memories. Our greedy inference algorithm has linear time, including feature extraction. On the CoNLL 2008--9 English shared tasks, we obtain the best published parsing performance among models that jointly learn syntax and semantics.
Relational inductive biases, deep learning, and graph networks
Battaglia, Peter W., Hamrick, Jessica B., Bapst, Victor, Sanchez-Gonzalez, Alvaro, Zambaldi, Vinicius, Malinowski, Mateusz, Tacchetti, Andrea, Raposo, David, Santoro, Adam, Faulkner, Ryan, Gulcehre, Caglar, Song, Francis, Ballard, Andrew, Gilmer, Justin, Dahl, George, Vaswani, Ashish, Allen, Kelsey, Nash, Charles, Langston, Victoria, Dyer, Chris, Heess, Nicolas, Wierstra, Daan, Kohli, Pushmeet, Botvinick, Matt, Vinyals, Oriol, Li, Yujia, Pascanu, Razvan
Artificial intelligence (AI) has undergone a renaissance recently, making major progress in key domains such as vision, language, control, and decision-making. This has been due, in part, to cheap data and cheap compute resources, which have fit the natural strengths of deep learning. However, many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current approaches. In particular, generalizing beyond one's experiences--a hallmark of human intelligence from infancy--remains a formidable challenge for modern AI. The following is part position paper, part review, and part unification. We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. Just as biology uses nature and nurture cooperatively, we reject the false choice between "hand-engineering" and "end-to-end" learning, and instead advocate for an approach which benefits from their complementary strengths. We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them. We present a new building block for the AI toolkit with a strong relational inductive bias--the graph network--which generalizes and extends various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. We discuss how graph networks can support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.
Pushing the bounds of dropout
Melis, Gábor, Blundell, Charles, Kočiský, Tomáš, Hermann, Karl Moritz, Dyer, Chris, Blunsom, Phil
We show that dropout training is best understood as performing MAP estimation concurrently for a family of conditional models whose objectives are themselves lower bounded by the original dropout objective. This discovery allows us to pick any model from this family after training, which leads to a substantial improvement on regularisation-heavy language modelling. The family includes models that compute a power mean over the sampled dropout masks, and their less stochastic subvariants with tighter and higher lower bounds than the fully stochastic dropout objective. We argue that since the deterministic subvariant's bound is equal to its objective, and the highest amongst these models, the predominant view of it as a good approximation to MC averaging is misleading. Rather, deterministic dropout is the best available approximation to the true objective.