AITopics | leakyrelu

Collaborating Authors

leakyrelu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Wang, Mingze, Wang, Jinbo, Xia, Yikuan, Shen, Kai, Zhong, Shu

arXiv.org Machine LearningMay-27-2026

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.

large language model, machine learning, tanh, (19 more...)

arXiv.org Machine Learning

2605.26647

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Reliable Estimation of KLDivergence using a Discriminator in Reproducing Kernel Hilbert Space Supplementary Material

Neural Information Processing SystemsApr-25-2026, 23:05:59 GMT

Organization: This supplementary material is presented in a format parallel to the main paper. The section numbers and titles are consistent with the main paper. But, here we also add one new section: Section 10 where we describe the societal impacts and possible negative impacts of the paper. Similarly, the Theorem numbers are consistent with the main paper, but we also have several additional theorems and lemmas which were not included in the main paper. GAN-type Objective for KLEstimation Let f be a discriminator, f: X IR. Let p(x) and q(x) be two probability density functions defined over the space X.

artificial intelligence, dim, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Supplementary Material 1 Derivation of ELBO

Neural Information Processing SystemsApr-25-2026, 08:11:56 GMT

In this section, we provide a short overview of the definitions relevant to the context of our work. The symmetry of an object is a transformation that leaves some of its properties unchanged.

artificial intelligence, machine learning, sequence, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

151de84cca69258b17375e2f44239191-Supplemental.pdf

Neural Information Processing SystemsApr-24-2026, 20:16:23 GMT

artificial intelligence, machine learning, pasta-gan, (14 more...)

Neural Information Processing Systems

Genre: Research Report (0.47)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

f19c44d068fecac1d6d13a80df4f8e96-Supplemental.pdf

Neural Information Processing SystemsFeb-11-2026, 21:08:56 GMT

batchnorm2d, ngf, representation, (16 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)

Add feedback

a1263ffa557506ea29c54481788d518f-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 02:20:41 GMT

gradient, optical encoder, plane, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

952b691c116bf753daafa6ce274e81bb-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 21:03:54 GMT

activation function, eigenvalue, probability, (16 more...)

Neural Information Processing Systems

Country: North America (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

BayesianAttentionModules: Appendix AAlgorithm

Neural Information Processing SystemsFeb-10-2026, 03:20:01 GMT

Then softmax is applied to obtain probabilities. Totunethehyperparameters in BAM, we randomly hold out20% of the training set for validation. The vocabulary sizeV is 9488 and the max captionlengthT is16. During training, weuseMLElossonlywithout scheduled sampling or RLloss. At the stepj of decoding, current LSTM state x (a function of previous target words y1:j 1) is used as query.

anddmid, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback