Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness
Crawshaw, Michael, Liu, Mingrui
–arXiv.org Artificial Intelligence
Recent results in non-convex stochastic optimization demonstrate the convergence of popular adaptive algorithms (e.g., AdaGrad) under the $(L_0, L_1)$-smoothness condition, but the rate of convergence is a higher-order polynomial in terms of problem parameters like the smoothness constants. The complexity guaranteed by such algorithms to find an $ε$-stationary point may be significantly larger than the optimal complexity of $Θ\left( ΔL σ^2 ε^{-4} \right)$ achieved by SGD in the $L$-smooth setting, where $Δ$ is the initial optimality gap, $σ^2$ is the variance of stochastic gradient. However, it is currently not known whether these higher-order dependencies can be tightened. To answer this question, we investigate complexity lower bounds for several adaptive optimization algorithms in the $(L_0, L_1)$-smooth setting, with a focus on the dependence in terms of problem parameters $Δ, L_0, L_1$. We provide complexity bounds for three variations of AdaGrad, which show at least a quadratic dependence on problem parameters $Δ, L_0, L_1$. Notably, we show that the decorrelated variant of AdaGrad-Norm requires at least $Ω\left( Δ^2 L_1^2 σ^2 ε^{-4} \right)$ stochastic gradient queries to find an $ε$-stationary point. We also provide a lower bound for SGD with a broad class of adaptive stepsizes. Our results show that, for certain adaptive algorithms, the $(L_0, L_1)$-smooth setting is fundamentally more difficult than the standard smooth setting, in terms of the initial optimality gap and the smoothness constants.
arXiv.org Artificial Intelligence
May-9-2025
- Country:
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States (0.04)
- Europe > United Kingdom
- Genre:
- Research Report > New Finding (0.85)
- Industry:
- Education (0.45)
- Technology: