Sparse and Continuous Attention Mechanisms André F. T. Martins,, Ant os Treviso Vlad Niculae

Neural Information Processing Systems 

Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, there has been recent work on sparse alternatives to softmax (e.g.