Reviews: Kernelized Bayesian Softmax for Text Generation

Neural Information Processing Systems 

This paper builds on the motivation that context vectors from a language model, such as BERT, often cluster into separate groups for the same next word. These clusters may correspond to different senses of the word, and often have varying variances. The authors argue that a traditional softmax is not expressive enough to capture these clusters. A similar argument was made by Yang et al in their Mixture of Softmax (MoS) paper. The solution presented here is quite different though -- to allocate multiple senses to each word in the output embedding table, and to use a parameterized kernel to model the variance. The ideas are pretty neat, and as far as i know, original.