Collaborating Authors

Understanding Masked Language Models (MLM) and Causal Language Models (CLM) in NLP


Most of the modern-day NLP systems have been following a pretty standard approach for training new models for various use-cases and that is First Pre-train then Fine-tune. Here, the goal of pre-training is to leverage large amounts of unlabeled text and build a general model of language understanding before being fine-tuned on various specific NLP tasks such as machine translation, text summarization, etc. In this blog, we will discuss two popular pre-training schemes, namely, Masked Language Modeling (MLM) and Causal Language Modeling (CLM). Under Masked Language Modelling, we typically mask a certain % of words in a given sentence and the model is expected to predict those masked words based on other words in that sentence. Such a training scheme makes this model bidirectional in nature because the representation of the masked word is learnt based on the words that occur it's left as well as right.

Machine Learning: Natural Language Processing in Python (V2)


Welcome to Machine Learning: Natural Language Processing in Python (Version 2). In part 1, which covers vector models and text preprocessing methods, you will learn about why vectors are so essential in data science and artificial intelligence. You will learn about various techniques for converting text into vectors, such as the CountVectorizer and TF-IDF, and you'll learn the basics of neural embedding methods like word2vec, and GloVe. You'll then apply what you learned for various tasks, such as: Along the way, you'll also learn important text preprocessing steps, such as tokenization, stemming, and lemmatization. You'll be introduced briefly to classic NLP tasks such as parts-of-speech tagging.

Topical Language Generation using Transformers Artificial Intelligence

Large-scale transformer-based language models (LMs) demonstrate impressive capabilities in open text generation. However, controlling the generated text's properties such as the topic, style, and sentiment is challenging and often requires significant changes to the model architecture or retraining and fine-tuning the model on new supervised data. This paper presents a novel approach for Topical Language Generation (TLG) by combining a pre-trained LM with topic modeling information. We cast the problem using Bayesian probability formulation with topic probabilities as a prior, LM probabilities as the likelihood, and topical language generation probability as the posterior. In learning the model, we derive the topic probability distribution from the user-provided document's natural structure. Furthermore, we extend our model by introducing new parameters and functions to influence the quantity of the topical features presented in the generated text. This feature would allow us to easily control the topical properties of the generated text. Our experimental results demonstrate that our model outperforms the state-of-the-art results on coherency, diversity, and fluency while being faster in decoding.

Multiplicative Models for Recurrent Language Modeling Machine Learning

Recently, there has been interest in multiplicative recurrent neural networks for language modeling. Indeed, simple Recurrent Neural Networks (RNNs) encounter difficulties recovering from past mistakes when generating sequences due to high correlation between hidden states. These challenges can be mitigated by integrating second-order terms in the hidden-state update. One such model, multiplicative Long Short-Term Memory (mLSTM) is particularly interesting in its original formulation because of the sharing of its second-order term, referred to as the intermediate state. We explore these architectural improvements by introducing new models and testing them on character-level language modeling tasks. This allows us to establish the relevance of shared parametrization in recurrent language modeling.

A brief timeline of NLP from Bag of Words to the Transformer family


Bag of Words (BOW) [1954]: count the occurrences of each word in the documents and use them as features. TF-IDF [1972]: the BOW scores are modified so that rare words have high scores and common words have low scores. Word2Vec [2013]: each word is mapped to a high-dimensional vector called word embedding, which captures its semantic. Word embeddings are learned by a neural network looking for word correlations on a large corpus. RNN [1986]: RNNs compute document embeddings leveraging word context in sentences, which was not possible with word embeddings alone.