Goto

Collaborating Authors

 Gradient Descent




An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

Neural Information Processing Systems

However, it remains unclear if we can further improve the convergence rate when the assumptions for the function in the population level also hold for each random realization almost surely (e.g., Lipschitzness of each realization of the stochastic gradient).


Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates

Neural Information Processing Systems

In particular, we establish the surprising result that: F or any constant learning rate ฮท > 0, the stochastic gradient bandit algorithm is guaranteed to converge to the globally optimal policy almost surely.



Stable Nonconvex-Nonconcave Training via Linear Interpolation

Neural Information Processing Systems

By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent.




Transformers learn to implement preconditioned gradient descent for in-context learning

Neural Information Processing Systems

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent.