A language model is a function, or an algorithm for learning such a function, that captures the salient statistical characteristics of the distribution of sequences of words in a natural language, typically allowing one to make probabilistic predictions of the next word given preceding ones. A neural network language model is a language model based on Neural Networks, exploiting their ability to learn distributed representations to reduce the impact of the curse of dimensionality. In the context of learning algorithms, the curse of dimensionality refers to the need for huge numbers of training examples when learning highly complex functions. When the number of input variables increases, the number of required examples can grow exponentially. The curse of dimensionality arises when a huge number of different combinations of values of the input variables must be discriminated from each other, and the learning algorithm needs at least one example per relevant combination of values.
Undirected neural sequence models such as BERT [Devlin et al., 2019] have received renewed interest due to their success on discriminative natural language understanding tasks such as question-answering and natural language inference. The problem of generating sequences directly from these models has received relatively little attention, in part because generating from such models departs significantly from the conventional approach of monotonic generation in directed sequence models. We investigate this problem by first proposing a generalized model of sequence generation that unifies decoding in directed and undirected models. The proposed framework models the process of generation rather than a resulting sequence, and under this framework, we derive various neural sequence models as special cases, such as autoregressive, semi-autoregressive, and refinement-based non-autoregressive models. This unification enables us to adapt decoding algorithms originally developed for directed sequence models to undirected models. We demonstrate this by evaluating various decoding strategies for the recently proposed cross-lingual masked translation model [Lample and Conneau, 2019]. Our experiments reveal that generation from undirected sequence models, under our framework, is competitive with the state of the art on WMT'14 English-German translation. We furthermore observe that the proposed approach enables constant-time translation while remaining within 1 BLEU score compared to linear-time translation from the same undirected neural sequence model.
In this week's post we will have a closer look at a paper dealing with the modeling of style, topic and high-level syntactic structures in language models by introducing global distributed latent representations. In particular, the variational autoencoder seems to be a promising candidate for pushing generative language models forwards and including global features. Recurrent neural network language models are known to be capable of modeling complex distributions over sequences. However, their architecture limits them to modeling local statistics over sequences and therefore global features have to be captured otherwise. Non-generative language models include the standard recurrent neural network language model, which predicts words depending on previous seen words and does not learn a global vector representation of the sequence at any time.
Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages.
Recent advances in conditional recurrent language modelling have mainly focused on network architectures (e.g., attention mechanism), learning algorithms (e.g., scheduled sampling and sequence-level training) and novel applications (e.g., image/video description generation, speech recognition, etc.) On the other hand, we notice that decoding algorithms/strategies have not been investigated as much, and it has become standard to use greedy or beam search. In this paper, we propose a novel decoding strategy motivated by an earlier observation that nonlinear hidden layers of a deep neural network stretch the data manifold. The proposed strategy is embarrassingly parallelizable without any communication overhead, while improving an existing decoding algorithm. We extensively evaluate it with attention-based neural machine translation on the task of En->Cz translation.