On the O(d K1/4)Convergence Rate of AdamW Measured by ℓ1 Norm
–Neural Information Processing Systems
As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood.
Neural Information Processing Systems
Jun-22-2026, 14:07:20 GMT