leakyrelu
More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
Wang, Mingze, Wang, Jinbo, Xia, Yikuan, Shen, Kai, Zhong, Shu
Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.
Reliable Estimation of KLDivergence using a Discriminator in Reproducing Kernel Hilbert Space Supplementary Material
Organization: This supplementary material is presented in a format parallel to the main paper. The section numbers and titles are consistent with the main paper. But, here we also add one new section: Section 10 where we describe the societal impacts and possible negative impacts of the paper. Similarly, the Theorem numbers are consistent with the main paper, but we also have several additional theorems and lemmas which were not included in the main paper. GAN-type Objective for KLEstimation Let f be a discriminator, f: X IR. Let p(x) and q(x) be two probability density functions defined over the space X.
BayesianAttentionModules: Appendix AAlgorithm
Then softmax is applied to obtain probabilities. Totunethehyperparameters in BAM, we randomly hold out20% of the training set for validation. The vocabulary sizeV is 9488 and the max captionlengthT is16. During training, weuseMLElossonlywithout scheduled sampling or RLloss. At the stepj of decoding, current LSTM state x (a function of previous target words y1:j 1) is used as query.