Review for NeurIPS paper: Bayesian Attention Modules

Neural Information Processing Systems 

The improvement compared to deterministic attention seems marginal on some tasks such as VQA and machine translation. Unlike prior works on discrete latent variables that can make interpretability claims, I'm not sure why we want to model soft attention as a latent variable. Are they better under low resource scenarios? Also, can you do more quantification of the benefits of modeling attention uncertainties? Or even qualitatively showing a few examples of samples from the attention distribution and see if they truly reflect the underlying uncertainties.