A language model is a function, or an algorithm for learning such a function, that captures the salient statistical characteristics of the distribution of sequences of words in a natural language, typically allowing one to make probabilistic predictions of the next word given preceding ones. A neural network language model is a language model based on Neural Networks, exploiting their ability to learn distributed representations to reduce the impact of the curse of dimensionality. In the context of learning algorithms, the curse of dimensionality refers to the need for huge numbers of training examples when learning highly complex functions. When the number of input variables increases, the number of required examples can grow exponentially. The curse of dimensionality arises when a huge number of different combinations of values of the input variables must be discriminated from each other, and the learning algorithm needs at least one example per relevant combination of values.
We propose several ways of reusing subword embeddings and other weights in subword-aware neural language models. The proposed techniques do not benefit a competitive character-aware model, but some of them improve the performance of syllable- and morpheme-aware models while showing significant reductions in model sizes. We discover a simple hands-on principle: in a multi-layer input embedding model, layers should be tied consecutively bottom-up if reused at output. Our best morpheme-aware model with properly reused weights beats the competitive word-level model by a large margin across multiple languages and has 20%-87% fewer parameters.
Simply put, a language model is a statistical model that learns the distribution or probabilities of words in a sequence. It turns out that if we can achieve such a model with high fidelity, we can solve a few interesting tasks. For example, if we know that a word is likely to occur given some sequence of words, we can implement some useful functionality like email autocomplete (e.g., given the sequence "Have a great " .. we can predict that the next likely word is "day"). When these statistical models are derived using large neural networks with billions of parameters (hence the term large language models or LLMs), the results and application areas are even more impressive. Results from transformer-based model architectures like BERT, GPT etc., show that these models excel at several complex tasks e.g., they can mimic creative writing, predict sentiment, identify topics within sentences with few examples, meaningfully summarize lengthy documents, translate languages etc.
It all starts with a language model. Let's assume we have the sequence [my, cat's, breath, smells, like, cat, ____] and we want to guess the final word. There are several ways to create a language model. The most straightforward is an n-gram model that counts occurrences to estimate frequencies. A bare-bones implementation requires only a dozen lines of Python code and can be surprisingly powerful.
Recent work on language modelling has shifted focus from count-based models to neural models. In these works, the words in each sentence are always considered in a left-to-right order. In this paper we show how we can improve the performance of the recurrent neural network (RNN) language model by incorporating the syntactic dependencies of a sentence, which have the effect of bringing relevant contexts closer to the word being predicted. We evaluate our approach on the Microsoft Research Sentence Completion Challenge and show that the dependency RNN proposed improves over the RNN by about 10 points in accuracy. Furthermore, we achieve results comparable with the state-of-the-art models on this task.