autoregressor
Probability Distributions Computed by Hard-Attention Transformers
Yang, Andy, Svete, Anej, Li, Jiaoda, Lin, Anthony Widjaja, Rawski, Jonathan, Cotterell, Ryan, Chiang, David
Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Switzerland (0.04)
- Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)
- Europe > Germany > Rhineland-Palatinate > Landau (0.04)
Neural Machine Translation with Imbalanced Classes
We cast neural machine translation (NMT) as a classification task in an autoregressive setting and analyze the limitations of both classification and autoregression components. Classifiers are known to perform better with balanced class distributions during training. Since the Zipfian nature of languages causes imbalanced classes, we explore the effect of class imbalance on NMT. We analyze the effect of vocabulary sizes on NMT performance and reveal an explanation for 'why' certain vocabulary sizes are better than others.
- North America > United States > California (0.14)
- Oceania > Australia (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (6 more...)
- Government > Regional Government > North America Government > United States Government (0.46)
- Government > Military (0.46)