Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Bianchessi, Arthur S., Aguirre, Yasmin C., Barros, Rodrigo C., Kupssinskü, Lucas S.

Sep-26-2025–arXiv.org Artificial Intelligence

Effective PE is vital, particularly for enabling LMs trained on shorter contexts to generalize to significantly longer sequences during inference--a desirable capability known as context length extrapolation. Several PE methods have been proposed to facilitate context length extrapolation, including Sinusoidal embeddings (V aswani, 2017), RoPE (Su et al., 2024), ALiBi (Press et al., 2022), and even the omission Bayesian attention mechanism, hereby called BAM. 2.1 B This dependency is trivially modeled by a scalar Z when the scoring function is additive, as detailed below. If the scoring function of the attention mechanism is additive, i.e., of the form With Theorem 1, we can frame positional encoding as priors to BAM. Lemma 2. ALiBi is a special case of BAM prior where the token position distribution comprises Lemma 3. ALiBi becomes local attention as the relative length |j i| increases. See Appendix B.1, B.2, and B.3. 2.3 A PE We call this new PE method GGD-BAM.

context length, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Sep-26-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found