Reviews: Distilled Wasserstein Learning for Word Embedding and Topic Modeling

Neural Information Processing Systems 

Summary The authors present a Distilled Wasserstein Learning (DWL) method for simultaneously learning a topic model alongside a word embedding using a model/approach based on the Wasserstein distance applied to elements of finite simplices. This is claimed as the first such method to simultaneously fit topics alongside embeddings. In particular, their embeddings only exploit on document co-occurence rather than nearby co-occurence within a sequence (i.e. using word order information) such as with word2vec. The authors demonstrate the superiority of their embeddings against a variety of benchmarks on three tasks: mortality prediction, admissions-type prediction, and procedure recommendation, using a single corpus of patient admission records where words are the international classification of diseases (ICD) ids of procedures and diseases. There are a number of apparently novel features to their approach which they outline in their paper, namely: * It is a topic model where observed word frequencies within a document are approximated as the *barycentres* (centre of mass) of a weighted sum over a low rank basis of topics (where these barycentres are with respect to some Wasserstein distance).