Neural network-based generative language models like ELMo and BERT can work effectively as general purpose sentence encoders in text classification without further fine-tuning. Is it possible to adapt them in a similar way for use as general-purpose decoders? For this to be possible, it would need to be the case that for any target sentence of interest, there is some continuous representation that can be passed to the language model to cause it to reproduce that sentence. We set aside the difficult problem of designing an encoder that can produce such representations and, instead, ask directly whether such representations exist at all. To do this, we introduce a pair of effective, complementary methods for feeding representations into pretrained unconditional language models and a corresponding set of methods to map sentences into and out of this representation space, the reparametrized sentence space.
It's impressive that Generative models like Open AI's GPT-2 automatically create texts using limited input. But controlling the attributes (topics, context, sentiment) of these texts, and paragraphs need an extra layer of work that includes architectural modifications/specific data understanding, etc. This work is done by a team of professionals from Uber, Caltech, and the Hong Kong University of Science and Technology. They worked on the model and created the Plug and Play Language Model (PPLM), which takes one or two attributes classifier and combines it with a pre-trained language model.
Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models.
Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training -- \eg overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings. In this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary -- discouraging the VQA model from capturing language biases in its question encoding.Further, we leverage this question-only model to estimate the mutual information between the image and answer given the question, which we maximize explicitly to encourage visual grounding.
"What do data-rich models know that models with less pre-training data do not?" The performance of language models is determined mostly by the amount of training data, quality of the training data and choice of modelling technique for estimation. Pretrained language models like BERT use massive datasets on the order of tens or even hundreds of billions of words to learn linguistic features and world knowledge, and they can be fine-tuned to achieve good performance on many downstream tasks. General-purpose pre-trained language models achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge, ask the researchers at NYU, do these models learn from large scale pretraining that they cannot learn from less data? To understand the relation between massiveness of data and learning in language models, the researchers adopted four probing methods -- classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks and plotted to learn curves (shown above) for the four probing methods.