gated linear unit
Flat Channels to Infinity in Neural Loss Landscapes
Martinelli, Flavio, Van Meegen, Alexander, Şimşek, Berfin, Gerstner, Wulfram, Brea, Johanni
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
RapidBERT_NeurIPS_Submission-2023-5-24-358pm
The GLUE benchmark consists of 8 (originally 9) tasks [Wang et al., 2018]. Hypothesis: "It has a buffet." CoLA (Corpus of Linguistic Acceptability) [8,551 train, 1,063 test] [Warstadt et al., 2019] is a "The higher the stakes, the lower his expectations are." The task is to classify the sentiment as either positive or negative [Socher et al., 2013]. Note that we excluded finetuning on the 9th GLUE task WNLI (Winograd NLI) [Levesque et al., We used the hyperparameters in Table S1 for finetuning all BERT and RapidBERT models.
Expanded Gating Ranges Improve Activation Functions
Activation functions are core components of all deep learning architectures. Currently, the most popular activation functions are smooth ReLU variants like GELU and SiLU. These are self-gated activation functions where the range of the gating function is between zero and one. In this paper, we explore the viability of using arctan as a gating mechanism. A self-gated activation function that uses arctan as its gating function has a monotonically increasing first derivative. To make this activation function competitive, it is necessary to introduce a trainable parameter for every MLP block to expand the range of the gating function beyond zero and one. We find that this technique also improves existing self-gated activation functions. We conduct an empirical evaluation of Expanded ArcTan Linear Unit (xATLU), Expanded GELU (xGELU), and Expanded SiLU (xSiLU) and show that they outperform existing activation functions within a transformer architecture. Additionally, expanded gating ranges show promising results in improving first-order Gated Linear Units (GLU).
Improving Knee Joint Angle Prediction through Dynamic Contextual Focus and Gated Linear Units
Saoud, Lyes Saad, Ibrahim, Humaid, Aljarah, Ahmad, Hussain, Irfan
Accurate knee joint angle prediction is crucial for biomechanical analysis and rehabilitation. In this study, we introduce FocalGatedNet, a novel deep learning model that incorporates Dynamic Contextual Focus (DCF) Attention and Gated Linear Units (GLU) to enhance feature dependencies and interactions. Our model is evaluated on a large-scale dataset and compared to established models in multi-step gait trajectory prediction. Our results reveal that FocalGatedNet outperforms existing models for long-term prediction lengths (20 ms, 60 ms, 80 ms, and 100 ms), demonstrating significant improvements in Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). Specifically for the case of 80 ms, FocalGatedNet achieves a notable MAE reduction of up to 24\%, RMSE reduction of up to 14\%, and MAPE reduction of up to 36\% when compared to Transformer, highlighting its effectiveness in capturing complex knee joint angle patterns. Moreover, FocalGatedNet maintains a lower computational load than most equivalent deep learning models, making it an efficient choice for real-time biomechanical analysis and rehabilitation applications.
GLU Variants Improve Transformer
The Transformer [Vaswani et al., 2017] sequence-to-sequence model alternates between multi-head attention, and what it calls "position-wise feed-forward networks" (FFN). The FFN takes a vector x (the hidden representation at a particular position in the sequence) and passes it through two learned linear transformations, (represented by the matrices W 1 and W 2 and bias vectors b 1 and b 2). A rectified-linear (ReLU) [Glorot et al., 2011] activation function applied between the two linear transformations. FFN(x, W 1, W 2, b 1, b 2) max(0, xW 1 b 1)W 2 b 2 (1) Following the T5 codebase [Raffel et al., 2019] 1, we use a version with no bias: FFN ReLU (x, W 1, W 2) max(xW 1, 0)W 2 (2) Subsequent work has proposed replacing the ReLU with other nonlinear activation functions such as Gaussian Error Linear Units, GELU(x) xΦ(x) [Hendrycks and Gimpel, 2016], and Swish β (x) xσ(βx) [Ramachandran et al., 2017].