Neural network-based generative language models like ELMo and BERT can work effectively as general purpose sentence encoders in text classification without further fine-tuning. Is it possible to adapt them in a similar way for use as general-purpose decoders? For this to be possible, it would need to be the case that for any target sentence of interest, there is some continuous representation that can be passed to the language model to cause it to reproduce that sentence. We set aside the difficult problem of designing an encoder that can produce such representations and, instead, ask directly whether such representations exist at all. To do this, we introduce a pair of effective, complementary methods for feeding representations into pretrained unconditional language models and a corresponding set of methods to map sentences into and out of this representation space, the reparametrized sentence space.
Under the leadership of the Fraunhofer Institutes for Intelligent Analysis and Information Systems (IAIS) and for Integrated Circuits (IIS), the OpenGPT-X project is starting with the goal of developing a large AI language model for Europe. Particular attention is being paid to data protection as well as European language diversity. "International competitors have already recognized the enormous disruptive potential of AI language technologies for business, industry and society. A European AI language model like OpenGPT-X is therefore imperative to ensure Europe's digital sovereignty and market independence," says Dr. Nicolas Flores-Herr, head of the project at Fraunhofer IAIS. Due to the high technical requirements, such as computing power, such powerful language models can so far only be implemented by large companies or consortia.
Most of the modern-day NLP systems have been following a pretty standard approach for training new models for various use-cases and that is First Pre-train then Fine-tune. Here, the goal of pre-training is to leverage large amounts of unlabeled text and build a general model of language understanding before being fine-tuned on various specific NLP tasks such as machine translation, text summarization, etc. In this blog, we will discuss two popular pre-training schemes, namely, Masked Language Modeling (MLM) and Causal Language Modeling (CLM). Under Masked Language Modelling, we typically mask a certain % of words in a given sentence and the model is expected to predict those masked words based on other words in that sentence. Such a training scheme makes this model bidirectional in nature because the representation of the masked word is learnt based on the words that occur it's left as well as right.
"What do data-rich models know that models with less pre-training data do not?" The performance of language models is determined mostly by the amount of training data, quality of the training data and choice of modelling technique for estimation. Pretrained language models like BERT use massive datasets on the order of tens or even hundreds of billions of words to learn linguistic features and world knowledge, and they can be fine-tuned to achieve good performance on many downstream tasks. General-purpose pre-trained language models achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge, ask the researchers at NYU, do these models learn from large scale pretraining that they cannot learn from less data? To understand the relation between massiveness of data and learning in language models, the researchers adopted four probing methods -- classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks and plotted to learn curves (shown above) for the four probing methods.
Language models are a key building block in modern Natural Language Processing. Mathematically, language models are a probability distribution over words and sequences of words. In recent years, there has been extensive work in the NLP community towards building models for every language, like German, Spanish, and even Esperanto, as you can see in the tutorial that inspired this article. One could think of language models as a mapping between the numerical representation of words and a latent space. In the general case (English, Spanish, …), we should expect language models to be broader, as they try to create a latent space able to capture all these different language nuances (elevated, formal, mundane and even vulgar).