Cautious Weight Decay
Chen, Lizhang, Li, Jonathan, Liang, Kaizhao, Su, Baiyu, Xie, Cong, Pierse, Nuo Wang, Liang, Chen, Lao, Ni, Liu, Qiang
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Oct-15-2025
- Country:
- North America > United States
- Texas > Travis County
- Austin (0.04)
- California > San Francisco County
- San Francisco (0.14)
- Texas > Travis County
- Asia
- Middle East > Jordan (0.04)
- China > Tianjin Province
- Tianjin (0.04)
- North America > United States
- Genre:
- Research Report (0.64)
- Technology: