online stochastic gradient descent
Distributionally Time-Varying Online Stochastic Optimization under Polyak-{\L}ojasiewicz Condition with Application in Conditional Value-at-Risk Statistical Learning
Pun, Yuen-Man, Farokhi, Farhad, Shames, Iman
In this work, we consider a sequence of stochastic optimization problems following a time-varying distribution via the lens of online optimization. Assuming that the loss function satisfies the Polyak-{\L}ojasiewicz condition, we apply online stochastic gradient descent and establish its dynamic regret bound that is composed of cumulative distribution drifts and cumulative gradient biases caused by stochasticity. The distribution metric we adopt here is Wasserstein distance, which is well-defined without the absolute continuity assumption or with a time-varying support set. We also establish a regret bound of online stochastic proximal gradient descent when the objective function is regularized. Moreover, we show that the above framework can be applied to the Conditional Value-at-Risk (CVaR) learning problem. Particularly, we improve an existing proof on the discovery of the PL condition of the CVaR problem, resulting in a regret bound of online stochastic gradient descent.
Poor starting points in machine learning
In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. A good option is to roll off Nesterov acceleration for later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.