Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Bianchessi, Arthur S., Aguirre, Yasmin C., Barros, Rodrigo C., Kupssinskü, Lucas S.
–arXiv.org Artificial Intelligence
Effective PE is vital, particularly for enabling LMs trained on shorter contexts to generalize to significantly longer sequences during inference--a desirable capability known as context length extrapolation. Several PE methods have been proposed to facilitate context length extrapolation, including Sinusoidal embeddings (V aswani, 2017), RoPE (Su et al., 2024), ALiBi (Press et al., 2022), and even the omission Bayesian attention mechanism, hereby called BAM. 2.1 B This dependency is trivially modeled by a scalar Z when the scoring function is additive, as detailed below. If the scoring function of the attention mechanism is additive, i.e., of the form With Theorem 1, we can frame positional encoding as priors to BAM. Lemma 2. ALiBi is a special case of BAM prior where the token position distribution comprises Lemma 3. ALiBi becomes local attention as the relative length |j i| increases. See Appendix B.1, B.2, and B.3. 2.3 A PE We call this new PE method GGD-BAM.
arXiv.org Artificial Intelligence
Sep-26-2025
- Country:
- South America > Brazil > Rio Grande do Sul > Porto Alegre (0.04)
- Genre:
- Research Report > New Finding (0.67)
- Technology: