AITopics | gated linear unit

Collaborating Authors

gated linear unit

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Masked Gated Linear Unit

Neural Information Processing SystemsJun-14-2026, 05:56:51 GMT

Gated Linear Units (GLUs) have become essential components in the feed-forward networks of state-of-the-art Large Language Models (LLMs). However, they require twice as many memory reads compared to feed-forward layers without gating, due to the use of separate weight matrices for the gate and value streams. To address this bottleneck, we introduce Masked Gated Linear Units (MGLUs), a novel family of GLUs with an efficient kernel implementation. The core contribution of MGLUs include: (1) the Mixture of Element-wise Gating (MoEG) architecture that learns multiple binary masks, each determining gate or value assignments at the element level on a single shared weight matrix resulting in reduced memory transfer, and (2) FlashMGLU, a hardware-friendly kernel that yields up to a 19.7$\times$ inference-time speed-up over a na\ive PyTorch MGLU and is 47\% more memory-efficient and 34\% faster than standard GLUs despite added architectural complexity on an RTX5090 GPU. In LLM experiments, the Swish-activated variant SwiMGLU preserves its memory advantages while matching--or even surpassing--the downstream accuracy of the SwiGLU baseline.

large language model, machine learning, natural language, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

Flat Channels to Infinity in Neural Loss Landscapes

Neural Information Processing SystemsJun-12-2026, 14:20:48 GMT

artificial intelligence, machine learning, mathbf, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback

Flow Matching for Scalable Simulation-Based Inference

Neural Information Processing SystemsNov-15-2025, 05:51:21 GMT

Figure 1: Comparison of network architectures (left) and flow trajectories (right). Discrete flows (NPE, top) require a specialized architecture for the density estimator. Continuous flows (FMPE, bottom) are based on a vector field parametrized with an unconstrained architecture.

artificial intelligence, bayesian inference, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Data Science (0.93)
(3 more...)

Add feedback

Flat Channels to Infinity in Neural Loss Landscapes

Martinelli, Flavio, Van Meegen, Alexander, Şimşek, Berfin, Gerstner, Wulfram, Brea, Johanni

arXiv.org Artificial IntelligenceNov-13-2025

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

artificial intelligence, machine learning, saddle line, (17 more...)

arXiv.org Artificial Intelligence

2506.14951

Country:

Asia (0.28)
Europe (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Flow Matching for Scalable Simulation-Based Inference

Neural Information Processing SystemsOct-8-2025, 10:47:11 GMT

architecture, arxiv, inference, (14 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Data Science (0.93)
(3 more...)

Add feedback

RapidBERT_NeurIPS_Submission-2023-5-24-358pm

Jacob Portes

Neural Information Processing SystemsSep-24-2025, 12:39:19 GMT

batch size, implementation, throughput, (12 more...)

Neural Information Processing Systems

Industry:

Materials > Chemicals > Industrial Gases > Liquified Gas (0.50)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.50)
Energy > Oil & Gas > Midstream (0.50)

Technology: Information Technology > Artificial Intelligence (0.95)

Add feedback

RapidBERT_NeurIPS_Submission-2023-5-24-358pm

Jacob Portes

Neural Information Processing SystemsAug-13-2025, 21:56:24 GMT

The GLUE benchmark consists of 8 (originally 9) tasks [Wang et al., 2018]. Hypothesis: "It has a buffet." CoLA (Corpus of Linguistic Acceptability) [8,551 train, 1,063 test] [Warstadt et al., 2019] is a "The higher the stakes, the lower his expectations are." The task is to classify the sentiment as either positive or negative [Socher et al., 2013]. Note that we excluded finetuning on the 9th GLUE task WNLI (Winograd NLI) [Levesque et al., We used the hyperparameters in Table S1 for finetuning all BERT and RapidBERT models.

artificial intelligence, implementation, throughput, (13 more...)

Neural Information Processing Systems

Industry:

Materials > Chemicals > Industrial Gases > Liquified Gas (0.50)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.50)
Energy > Oil & Gas > Midstream (0.50)

Technology: Information Technology > Artificial Intelligence (0.95)

Add feedback

Expanded Gating Ranges Improve Activation Functions

Huang, Allen Hao

arXiv.org Artificial IntelligenceMay-25-2024

Activation functions are core components of all deep learning architectures. Currently, the most popular activation functions are smooth ReLU variants like GELU and SiLU. These are self-gated activation functions where the range of the gating function is between zero and one. In this paper, we explore the viability of using arctan as a gating mechanism. A self-gated activation function that uses arctan as its gating function has a monotonically increasing first derivative. To make this activation function competitive, it is necessary to introduce a trainable parameter for every MLP block to expand the range of the gating function beyond zero and one. We find that this technique also improves existing self-gated activation functions. We conduct an empirical evaluation of Expanded ArcTan Linear Unit (xATLU), Expanded GELU (xGELU), and Expanded SiLU (xSiLU) and show that they outperform existing activation functions within a transformer architecture. Additionally, expanded gating ranges show promising results in improving first-order Gated Linear Units (GLU).

activation function, gelu and silu, xatlu, (13 more...)

arXiv.org Artificial Intelligence

2405.20768

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improving Knee Joint Angle Prediction through Dynamic Contextual Focus and Gated Linear Units

Saoud, Lyes Saad, Ibrahim, Humaid, Aljarah, Ahmad, Hussain, Irfan

arXiv.org Artificial IntelligenceOct-2-2023

Accurate knee joint angle prediction is crucial for biomechanical analysis and rehabilitation. In this study, we introduce FocalGatedNet, a novel deep learning model that incorporates Dynamic Contextual Focus (DCF) Attention and Gated Linear Units (GLU) to enhance feature dependencies and interactions. Our model is evaluated on a large-scale dataset and compared to established models in multi-step gait trajectory prediction. Our results reveal that FocalGatedNet outperforms existing models for long-term prediction lengths (20 ms, 60 ms, 80 ms, and 100 ms), demonstrating significant improvements in Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). Specifically for the case of 80 ms, FocalGatedNet achieves a notable MAE reduction of up to 24\%, RMSE reduction of up to 14\%, and MAPE reduction of up to 36\% when compared to Transformer, highlighting its effectiveness in capturing complex knee joint angle patterns. Moreover, FocalGatedNet maintains a lower computational load than most equivalent deep learning models, making it an efficient choice for real-time biomechanical analysis and rehabilitation applications.

dynamic contextual focus, focalgatednet, gated linear unit, (7 more...)

arXiv.org Artificial Intelligence

2306.069

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Europe > Spain > Castile and León > Segovia Province > Segovia (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

GLU Variants Improve Transformer

Shazeer, Noam

arXiv.org Machine LearningFeb-12-2020

The Transformer [Vaswani et al., 2017] sequence-to-sequence model alternates between multi-head attention, and what it calls "position-wise feed-forward networks" (FFN). The FFN takes a vector x (the hidden representation at a particular position in the sequence) and passes it through two learned linear transformations, (represented by the matrices W 1 and W 2 and bias vectors b 1 and b 2). A rectified-linear (ReLU) [Glorot et al., 2011] activation function applied between the two linear transformations. FFN(x, W 1, W 2, b 1, b 2) max(0, xW 1 b 1)W 2 b 2 (1) Following the T5 codebase [Raffel et al., 2019] 1, we use a version with no bias: FFN ReLU (x, W 1, W 2) max(xW 1, 0)W 2 (2) Subsequent work has proposed replacing the ReLU with other nonlinear activation functions such as Gaussian Error Linear Units, GELU(x) xΦ(x) [Hendrycks and Gimpel, 2016], and Swish β (x) xσ(βx) [Ramachandran et al., 2017].

activation function, arxiv preprint arxiv, transformer, (14 more...)

arXiv.org Machine Learning

2002.05202

Genre: Research Report (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback