Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Mar-27-2025, 14:21:58 GMT–Neural Information Processing Systems

In current deep learning tasks, Adam-style optimizers--such as Adam, Adagrad, RMSprop, Adafactor, and Lion--have been widely used as alternatives to SGDstyle optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly (or follows similar rules) with batch size for SGD-style optimizers. However, this conclusion is not applicable to Adam-style optimizers.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Mar-27-2025, 14:21:58 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.28)

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (0.93)

Industry:
- Education > Educational Setting > Online (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found