Theoretical Interpretation of Learned Step Size in Deep-Unfolded Gradient Descent

Takabe, Satoshi, Wadayama, Tadashi

arXiv.org Machine Learning 

Theoretical Interpretation of Learned Step Size in Deep-Unfolded Gradient Descent Satoshi Takabe † and Tadashi Wadayama Nagoya Institute of Technology, Gokiso, Nagoya, Aichi, 466-8555, Japan, {wadayama, s_takabe}@nitech.ac.jp † RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, 103-0027, Japan Abstract --Deep unfolding is a promising deep-learning technique in which an iterative algorithm is unrolled to a deep network architecture with trainable parameters. In the case of gradient descent algorithms, as a result of the training process, one often observes the acceleration of the convergence speed with learned non-constant step size parameters whose behavior is not intuitive nor interpretable from conventional theory. In this paper, we provide a theoretical interpretation of the learned step size of deep-unfolded gradient descent (DUGD). We first prove that the training process of DUGD reduces not only the mean squared error loss but also the spectral radius related to the convergence rate. Next, we show that minimizing the upper bound of the spectral radius naturally leads to the Chebyshev step which is a sequence of the step size based on Chebyshev polynomials. The numerical experiments confirm that the Chebyshev steps qualitatively reproduce the learned step size parameters in DUGD, which provides a plausible interpretation of the learned parameters. Additionally, we show that the Chebyshev steps achieve the lower bound of the convergence rate for the first-order method in a specific limit without learning parameters or momentum terms. I NTRODUCTION Deep unfolding [10], [12] is a promising deep learning approach whose architecture is based on existing iterative algorithms with tuning parameters such as step sizes in gradient descent (GD). The recursive structure of the algorithm is unrolled to a deep network and some parameters are embedded into the network. These parameters can be trained using standard deep learning techniques such as back propagation and stochastic GD if all the processes in the algorithm are differentiable. One notable advantage of deep unfolding is the acceleration of the convergence speed that results from tuning parameters compared with the original algorithm. Embedding proper trainable parameters also offers a flexible network structure to the algorithm that is applicable, for example, to inverse problems with/without prior information [26]. Recently, theoretical aspects of deep unfolding have also been investigated [5], [21], [23]. MSE performance (upper) and learned step size parameters {γ t} 24 t 0 (lower) of DUGD (circles) and GD with a constant step size (cross marks) when (n,m) (300, 600).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found