Reviews: Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
–Neural Information Processing Systems
This paper identifies and separates (kernel) linear least-squares regression problems wherein carrying out multiple passes of stochastic gradient descent (SGD) over a training set can yield better statistical error than only a single pass. This is relevant to the core of machine learning theory, and relates to a line of work published at NIPS, ICML, COLT, and similar conferences in the past several years about the statistical error of one-pass, many-pass, and ERM-based learning. The authors focus on regression problems captured, by assumption, by two parameters: alpha, which governs the exponent of a power-law eigenvalue decay, and r, which governs a transformation under which the Hilbert norm of the optimal predictor is bounded. They refer to problems where r (alpha - 1) / (2 * alpha) as "hard". The main result of the paper is to show that for these "hard" problems, multiple SGD passes either achieve (minimax) optimal rates of statistical estimation, or at least improve the rate relative to a single pass. The results are interesting and might address an unanswered core question in machine learning, and the mathematical presentation is clear, with assumptions upfront.
Neural Information Processing Systems
Oct-7-2024, 05:11:12 GMT