AITopics | ofs

Understanding and Minimising Outlier Features in Transformer Training

Neural Information Processing SystemsMar-21-2026, 17:36:47 GMT

Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them.Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we introduce a novel unnormalised transformer block, the Outlier Protected block, and present a previously unknown benefit of non-diagonal preconditioning optimisers, finding both approaches to significantly reduce OFs and improve quantisation without compromising convergence speed, at scales of up to 7B parameters. Notably, our combination of OP block and non-diagonal preconditioner (SOAP) achieves 14.87 weight-and-activation int8 perplexity (from 14.71 in standard precision), compared to 63.4 int8 perplexity (from 16.00) with a default OF-prone combination of Pre-Norm model and Adam, when quantising OPT-125m models post-training.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

986292a930c3692168b177a770025ab3-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 20:53:28 GMT

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > Dominican Republic (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Energy (0.46)
Education > Educational Setting (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

Understanding and Minimising Outlier Features in Transformer Training Bobby He

Neural Information Processing SystemsOct-10-2025, 10:39:23 GMT

Despite their widespread use, our understanding of deep neural networks (NNs) and their training dynamics is very much incomplete.

international conference, kurtosis, ofs, (11 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > Dominican Republic (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Energy (0.46)
Education > Educational Setting (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

Understanding and Minimising Outlier Features in Transformer Training

Neural Information Processing SystemsMay-27-2025, 09:57:41 GMT

Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them.Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we introduce a novel unnormalised transformer block, the Outlier Protected block, and present a previously unknown benefit of non-diagonal preconditioning optimisers, finding both approaches to significantly reduce OFs and improve quantisation without compromising convergence speed, at scales of up to 7B parameters.

minimising outlier feature, ofs, transformer training, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.42)

Add feedback

Understanding and Minimising Outlier Features in Neural Network Training

He, Bobby, Noci, Lorenzo, Paliotta, Daniele, Schlag, Imanol, Hofmann, Thomas

arXiv.org Artificial IntelligenceMay-29-2024

Outlier Features (OF) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them. Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we emphasise the importance of controlling signal propagation throughout training, and propose the Outlier Protected transformer block, which removes standard Pre-Norm layers to mitigate OFs, without loss of convergence speed or training stability. Overall, our findings shed new light on our understanding of, our ability to prevent, and the complexity of this important facet in NN training dynamics.

international conference, kurtosis, ofs, (10 more...)

arXiv.org Artificial Intelligence

2405.19279

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Dominican Republic (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Energy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Granularity-Adaptive Proof Presentation

Schiller, Marvin, Benzmueller, Christoph

arXiv.org Artificial IntelligenceMay-25-2009

When mathematicians present proofs they usually adapt their explanations to their didactic goals and to the (assumed) knowledge of their addressees. Modern automated theorem provers, in contrast, present proofs usually at a fixed level of detail (also called granularity). Often these presentations are neither intended nor suitable for human use. A challenge therefore is to develop user- and goal-adaptive proof presentation techniques that obey common mathematical practice. We present a flexible and adaptive approach to proof presentation that exploits machine learning techniques to extract a model of the specific granularity of proof examples and employs this model for the automated generation of further proofs at an adapted level of granularity.

logic & formal reasoning, machine learning, presen tation, (20 more...)

arXiv.org Artificial Intelligence

0903.0314

Country: