Decoupled Relative Learning Rate Schedules

Ludziejewski, Jan, Małaśnicki, Jan, Pióro, Maciej, Krutul, Michał, Ciebiera, Kamil, Stefaniak, Maciej, Krajewski, Jakub, Sankowski, Piotr, Cygan, Marek, Adamczewski, Kamil, Jaszczur, Sebastian

Jul-8-2025–arXiv.org Artificial Intelligence

In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23\%$, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27\times$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jul-8-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China > Yunnan Province
    - Kunming (0.04)
  - Middle East > Jordan (0.04)
- Europe > Poland
  - Lower Silesia Province > Wroclaw (0.04)
  - Masovia Province > Warsaw (0.05)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.67)
  - Natural Language > Large Language Model (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found