AITopics | stable-spam

Collaborating Authors

stable-spam

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Minimalist Optimizer Design for LLM Pretraining

Glentis, Athanasios, Li, Jiaxiang, Han, Andi, Hong, Mingyi

arXiv.org Artificial IntelligenceDec-11-2025

Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which introduce extra operations and require significant more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), which boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple LLaMA models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B model, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon, in terms of both perplexity and memory consumption.

large language model, machine learning, normalization, (20 more...)

arXiv.org Artificial Intelligence

2506.16659

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > Minnesota (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Huang, Tianjin, Hu, Haotian, Zhang, Zhenyu, Jin, Gaojie, Li, Xiang, Shen, Li, Chen, Tianlong, Liu, Lu, Wen, Qingsong, Wang, Zhangyang, Liu, Shiwei

arXiv.org Artificial IntelligenceFeb-24-2025

This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical $l_2$-norm statistics; and $(3)$ inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to $2$ perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

large language model, machine learning, stable-spam, (17 more...)

arXiv.org Artificial Intelligence

2502.17055

Country:

Asia > China > Tianjin Province > Tianjin (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > North Carolina (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback