Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis
Hao, Jie, Gong, Xiaochuan, Liu, Mingrui
–arXiv.org Artificial Intelligence
Bilevel optimization is an important formulation for many machine learning problems, such as meta-learning and hyperparameter optimization. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz (i.e., the upper-level function has a bounded smoothness parameter). However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable for these neural networks. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: initialization refinement and periodic updates. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting and without mean-squared smoothness of the stochastic gradient, up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyperrepresentation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm. Bilevel optimization refers to an optimization problem where one problem is nested within another (Bracken & McGill, 1973; Dempe, 2002). One important application under this setting is hyper-representation learning with deep neural networks (Franceschi et al., 2018), where x denotes the shared representation Table 1: Comparison of oracle complexity of stochastic bilevel algorithms for finding an ϵ-stationary point as defined in Definition 1. The oracle stands for stochastic gradient and stochastic Hessian vector product.
arXiv.org Artificial Intelligence
Jan-17-2024
- Country:
- North America > United States (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.34)
- Technology: