GBA: A Tuning-free Approach to Switch between Synchronous and Asynchronous Training for Recommendation Models

Mar-27-2025, 14:57:20 GMT–Neural Information Processing Systems

High-concurrency asynchronous training upon parameter server (PS) architecture and high-performance synchronous training upon all-reduce (AR) architecture are the most commonly deployed distributed training modes for recommendation models. Although synchronous AR training is designed to have higher training efficiency, asynchronous PS training would be a better choice for training speed when there are stragglers (slow workers) in the shared cluster, especially under limited computing resources. An ideal way to take full advantage of these two training modes is to switch between them upon the cluster status. However, switching training modes often requires tuning hyper-parameters, which is extremely time-and resource-consuming. We find two obstacles to a tuning-free approach: the different distribution of the gradient values and the stale gradients from the stragglers.

artificial intelligence, machine learning, training mode, (16 more...)

Neural Information Processing Systems

Mar-27-2025, 14:57:20 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks (0.47)
    - Representation & Reasoning (1.00)
  - Data Science (0.68)