Probabilistic Transformers
We show that Transformers are Maximum Posterior Probability estimators for Mixtures of Gaussian Models. This brings a probabilistic point of view to Transformers and suggests extensions to inference-time model adaptation and to other probabilistic cases.
Nov-12-2020