Gradient Descent Algorithm Survey

Fucheng, Deng, Wanjie, Wang, Ao, Gong, Xiaoqi, Wang, Fan, Wang

arXiv.org Artificial Intelligence 

Its simple update, linear scalability with sample size, and compatibility with momentum, mini-batching, and learning-rate heuristics keep it dominant in both industry and academia. Current research continues to refine convergence rates, variance characterizations, and averaging schemes, while engineering efforts focus on hardware-aligned and distributed variants. B. Mini-Batch Stochastic Gradient Descent 1) Background and Development: Batch Gradient Descent (BGD) requires computing the gradient using the entire training dataset at each iteration. As dataset sizes expand to millions or even larger scales, the computational cost of a single iteration becomes extremely high, making it unsuitable for large-scale learning tasks. The convergence of SGD was proven by Robbins and Monro through the stochastic approximation method [1]. SGD uses one sample to update the gradient at each step, resulting in low computational cost but high gradient variance and unstable updates. The mini-batch strategy has gradually become the mainstream in practice, especially with the rise of large-scale machine learning and deep learning. Bottou emphasized the practical value of mini-batches in his research on large-scale learning [5], while systematic monographs and reviews on deep learning have further standardized this approach [6], [7]. Mini-batch SGD achieves an optimal balance between stability, high-frequency updates, and GPU parallel acceleration [2].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found