Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Arnaboldi, Luca, Dandi, Yatin, Krzakala, Florent, Loureiro, Bruno, Pesce, Luca, Stephan, Ludovic

Jun-4-2024–arXiv.org Machine Learning

We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

artificial intelligence, machine learning, sgd, (14 more...)

arXiv.org Machine Learning

Jun-4-2024

arXiv.org PDF

Add feedback

Country:
- Europe
  - France (0.14)
  - Spain (0.14)
  - Switzerland (0.14)
- North America > United States (0.14)

Genre:
- Research Report (1.00)

Industry:
- Education > Educational Setting > Online (0.64)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (1.00)
  - Statistical Learning > Gradient Descent (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found