Bayesian Attention Modules Xinjie Fan

Neural Information Processing Systems 

Most current models use deterministic attention modules due to their simplicity and ease of optimization. Stochastic counterparts, on the other hand, are less popular despite their potential benefits.