Correction of Decoupled Weight Decay

Chou, Jason Chuan-Chih

arXiv.org Artificial Intelligence 

Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate γ without questioning. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. For adaptive gradient methods such as SGD with momentum (Sutskever et al., 2013) and Adam (Kingma & Ba, 2015), weight decay is no longer equivalent to L Nevertheless, Defazio (2025) presents experiments on Llama 3 architecture (Grattafiori et al., 2024) in which most layers are not immediately followed by normalization. It states that "we consider every linear layer as normalized, excluding the output layer of the network" for the purpose of applying such corrected weight decay, and AdamC results in more stable weight and gradient norms than the AdamW baseline regardless. Consider the "Renormalized" AdamW optimizer above (Algorithm 1) which eliminates the contribution of u We train a variant of ViT -S/16 based on the setup described in Beyer et al. (2022) on the ImageNet-1k dataset (Russakovsky et al., 2015) for 90 epochs and instead observe almost no differences in relevant metrics (Figure 1).