Gradient Descent Algorithm Survey
Fucheng, Deng, Wanjie, Wang, Ao, Gong, Xiaoqi, Wang, Fan, Wang
–arXiv.org Artificial Intelligence
Its simple update, linear scalability with sample size, and compatibility with momentum, mini-batching, and learning-rate heuristics keep it dominant in both industry and academia. Current research continues to refine convergence rates, variance characterizations, and averaging schemes, while engineering efforts focus on hardware-aligned and distributed variants. B. Mini-Batch Stochastic Gradient Descent 1) Background and Development: Batch Gradient Descent (BGD) requires computing the gradient using the entire training dataset at each iteration. As dataset sizes expand to millions or even larger scales, the computational cost of a single iteration becomes extremely high, making it unsuitable for large-scale learning tasks. The convergence of SGD was proven by Robbins and Monro through the stochastic approximation method [1]. SGD uses one sample to update the gradient at each step, resulting in low computational cost but high gradient variance and unstable updates. The mini-batch strategy has gradually become the mainstream in practice, especially with the rise of large-scale machine learning and deep learning. Bottou emphasized the practical value of mini-batches in his research on large-scale learning [5], while systematic monographs and reviews on deep learning have further standardized this approach [6], [7]. Mini-batch SGD achieves an optimal balance between stability, high-frequency updates, and GPU parallel acceleration [2].
arXiv.org Artificial Intelligence
Nov-27-2025
- Country:
- Asia > Russia (0.04)
- Europe > Russia (0.04)
- North America > United States
- California (0.05)
- Genre:
- Research Report (0.50)
- Technology: