Goto

Collaborating Authors

 attention density


A Supplementary Materials

Neural Information Processing Systems

A.1 Dataset Description We describe the additional details of each dataset in the followings. And we use the first 90% for the training set and the last 10% as the validation set. And we leave the last 210 days as the test set. We further evaluate the sensitivities of different hyper-parameters. The distribution of attention densities of different α is shown in Figure 2.




A Supplementary Materials

Neural Information Processing Systems

A.1 Dataset Description We describe the additional details of each dataset in the followings. And we use the first 90% for the training set and the last 10% as the validation set. And we leave the last 210 days as the test set. We further evaluate the sensitivities of different hyper-parameters. The distribution of attention densities of different α is shown in Figure 2.




Hidden Dynamics of Massive Activations in Transformer Training

Gallego-Feliciano, Jorge, McClendon, S. Aaron, Morinelli, Juan, Zervoudakis, Stavros, Saravanos, Antonios

arXiv.org Artificial Intelligence

Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.


Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost

Cho, Sungjun, Min, Seonwoo, Kim, Jinwoo, Lee, Moontae, Lee, Honglak, Hong, Seunghoon

arXiv.org Artificial Intelligence

To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as $\alpha$-entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss. The forward and backward cost are thus linear to the number of edges, which each attention head can also choose flexibly based on the input. By assessing the distribution of graphs, we theoretically show that SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence functions in expectation. Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention. Our implementation can be found in https://github.com/sc782/SBM-Transformer .


Kernel Deformed Exponential Families for Sparse Continuous Attention

Moreno, Alexander, Nagesh, Supriya, Wu, Zhenke, Dempsey, Walter, Rehg, James M.

arXiv.org Machine Learning

Attention mechanisms take an expectation of a data representation with respect to probability weights. This creates summary statistics that focus on important features. Recently, Martins et al. (2020; 2021) proposed continuous attention mechanisms, focusing on unimodal attention densities from the exponential and deformed exponential families: the latter has sparse support. Farinhas et al. (2021) extended this to use Gaussian mixture attention densities, which are a flexible class with dense support. In this paper, we extend this to two general flexible classes: kernel exponential families (Canu & Smola, 2006) and our new sparse counterpart kernel deformed exponential families. Theoretically, we show new existence results for both kernel exponential and deformed exponential families, and that the deformed case has similar approximation capabilities to kernel exponential families. Experiments show that kernel deformed exponential families can attend to multiple compact regions of the data domain. Attention mechanisms take weighted averages of data representations (Bahdanau et al., 2015), where the weights are a function of input objects. These are then used as inputs for prediction. Discrete attention 1) cannot easily handle data where observations are irregularly spaced 2) attention maps may be scattered, lacking focus.