Goto

Collaborating Authors

 relu


More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

arXiv.org Machine Learning

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.





Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

Neural Information Processing Systems

One of the central questions in the theory of deep learning is to understand how neural networks learn hierarchical features. The ability of deep networks to extract salient features is crucial to both their outstanding generalization ability and the modern deep learning paradigm of pretraining and finetuneing. However, this feature learning process remains poorly understood from a theoretical perspective, with existing analyses largely restricted to two-layer networks. In this work we show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks. We analyze the features learned by a three-layer network trained with layer-wise gradient descent, and present a general purpose theorem which upper bounds the sample complexity and width needed to achieve low test error when the target has specific hierarchical structure. We instantiate our framework in specific statistical learning settings - single-index models and functions of quadratic features - and show that in the latter setting three-layer networks obtain a sample complexity improvement over all existing guarantees for two-layer networks. Crucially, this sample complexity improvement relies on the ability of three-layer networks to efficiently learn nonlinear features. We then establish a concrete optimization-based depth separation by constructing a function which is efficiently learnable via gradient descent on a three-layer network, yet cannot be learned efficiently by a two-layer network. Our work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.


The proposition makes use of the following observation: For the discriminator defined in (1), the norm of gradient for wt is upper bounded by k wtDθ(x)k F kxk LY

Neural Information Processing Systems

The upper bound of gradient's Frobenius norm for spectrally-normalized discriminators follows directly. As lw(x) is a linear transformation, we have lcw(x) = c lw(x), and lw(cx) = c lw(x). Moreover, since ReLU and leaky ReLU is linear in R+ and R region, we have ai(cx) = c ai(x). In this section we discuss the gradients with respect the actual parameter wi. From Eq. (12) in [30] we know wtDθ(x) = A, we know that w0tDθ(x) F, otl(x)Dθ(x), and kotl (x)k have upper bounds. From Theorem 1.1 in [44] we know that if wt is initialized with i.i.d random variables from uniform or Gaussian distribution, E kwtkspis lower bounded away from zero at initialization. So k wtDθ(x)kF is upper bounded at initialization. Moreover, we observe empirically that kwtksp is usually increasing during training. Therefore, k wtDθ(x)kF is typically upper bounded during training as well. The following proposition states that spectral normalization also gives an upper bound on kHwi(Dθ)(x)ksp for networks with ReLU or leaky ReLU internal activations.


47a658229eb2368a99f1d032c8848542-Supplemental.pdf

Neural Information Processing Systems

Based on the feedback from the reviewers, we perform the following additional experiments which 0 explore the robustness of the choice of buffer size in SGD RER, choice of step sizes for GLMtron 10 and the behavior of the said algorithms with heavy tailed noise with a similar setup as in Section 7. We first perform an experimental study about the robustness of SGD RER to the choice of buffer size in Figure 3a. Notice that the performance remains the same for a large range of buffer sizes ( 100 from to 2000). However the performance degrades when the buffer size is too large ( 10000). We believe this is the case since the number of buffers decreases as the buffer size increases and the output is averaged over too few number of iterates (In the case of B = 10000, the final output is just an average of 10 iterates). Theoretically, this largest step-size is L where Lis the largest eigenvalue of -1 the Hessian. In the case of GLMtron, it was experimentally observed that if the step size was chosen 10 to be about 1.5 times the step size reported in Section 7, the iterates diverged. Quasi Newton method essentially normalizes the gradient with the inverse of the Hessian (or rather an approximation of the Hessian) in order to let it converge faster with large step sizes. In Figure 4, we consider the same system as in Section 7 but with heavy tailed noise given by the student t distribution (scale ν = 4.1) so that the 4-th moment exists but higher moments do not. The typical behavior of Forward SGD, SGD-ER, SGD-RER and Quasi Newton methods seems to be similar to that observed in the Sub-Gaussian noise case. However, GLMtron requires much smaller step sizes to ensure convergence and hence it takes much longer.



Circa: Stochastic ReLUs for Private Deep Learning

Neural Information Processing Systems

The simultaneous rise of machine learning as a service and concerns over user privacy have increasingly motivated the need for private inference (PI). While recent work demonstrates PI is possible using cryptographic primitives, the computational overheads render it impractical. State-of-art deep networks are inadequate in this context because the source of slowdown in PI stems from the ReLU operations whereas optimizations for plaintext inference focus on reducing FLOPs. In this paper we re-think ReLU computations and propose optimizations for PI tailored to properties of neural networks. Specifically, we reformulate ReLU as an approximate sign test and introduce a novel truncation method for the sign test that significantly reduces the cost per ReLU. These optimizations result in a specific type of stochastic ReLU. The key observation is that the stochastic fault behavior is well suited for the fault-tolerant properties of neural network inference. Thus, we provide significant savings without impacting accuracy. We collectively call the optimizations Circa and demonstrate improvements of up to 4.7 storage and 3 runtime over baseline implementations; we further show that Circa can be used on top of recent PI optimizations to obtain 1.8 additional speedup.