Probability Distributions Computed by Hard-Attention Transformers

Open in new window