Reviews: Wasserstein Training of Restricted Boltzmann Machines

Neural Information Processing Systems 

The paper refers to [2] and says that those authors proved statistical consistency. However, I am then surprised to see in section 4.3 that non-zero shrinkage is obtained (including for gamma 0) for the very simple case of modelling a N(0,I) distribution with N(0, sigma 2 I). What is going on here?? A failure of consistency would be a serious flaw in the formulation of a statistical learning criterion. Also in sec 3 (Stability and KL regularization) the authors say that at least for learning based on samples (\hat{p}_{theta}) that some regularization wrt the KL divergence is required. This clearly weakens the "purity" of the smoothed Wasserstein objective fn.