Correction of Decoupled Weight Decay

Dec-10-2025–arXiv.org Artificial Intelligence

Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate γ without questioning. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. For adaptive gradient methods such as SGD with momentum (Sutskever et al., 2013) and Adam (Kingma & Ba, 2015), weight decay is no longer equivalent to L Nevertheless, Defazio (2025) presents experiments on Llama 3 architecture (Grattafiori et al., 2024) in which most layers are not immediately followed by normalization. It states that "we consider every linear layer as normalized, excluding the output layer of the network" for the purpose of applying such corrected weight decay, and AdamC results in more stable weight and gradient norms than the AdamW baseline regardless. Consider the "Renormalized" AdamW optimizer above (Algorithm 1) which eliminates the contribution of u We train a variant of ViT -S/16 based on the setup described in Beyer et al. (2022) on the ImageNet-1k dataset (Russakovsky et al., 2015) for 90 epochs and instead observe almost no differences in relevant metrics (Figure 1).

artificial intelligence, experiment, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Dec-10-2025

arXiv.org PDF

Add feedback

Country:
- North America (0.46)
- Asia (0.28)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found