Ueda, Masahito
Symbolic Equation Solving via Reinforcement Learning
Dabelow, Lennart, Ueda, Masahito
Machine-learning methods are gradually being adopted in a great variety of social, economic, and scientific contexts, yet they are notorious for struggling with exact mathematics. A typical example is computer algebra, which includes tasks like simplifying mathematical terms, calculating formal derivatives, or finding exact solutions of algebraic equations. Traditional software packages for these purposes are commonly based on a huge database of rules for how a specific operation (e.g., differentiation) transforms a certain term (e.g., sine function) into another one (e.g., cosine function). Thus far, these rules have usually needed to be discovered and subsequently programmed by humans. Focusing on the paradigmatic example of solving linear equations in symbolic form, we demonstrate how the process of finding elementary transformation rules and step-by-step solutions can be automated using reinforcement learning with deep neural networks.
The Probabilistic Stability of Stochastic Gradient Descent
Ziyin, Liu, Li, Botao, Galanti, Tomer, Ueda, Masahito
Characterizing and understanding the stability of Stochastic Gradient Descent (SGD) remains an open problem in deep learning. A common method is to utilize the convergence of statistical moments, esp. the variance, of the parameters to quantify the stability. We revisit the definition of stability for SGD and propose using the $\textit{convergence in probability}$ condition to define the $\textit{probabilistic stability}$ of SGD. The probabilistic stability sheds light on a fundamental question in deep learning theory: how SGD selects a meaningful solution for a neural network from an enormous number of possible solutions that may severely overfit. We show that only through the lens of probabilistic stability does SGD exhibit rich and practically relevant phases of learning, such as the phases of the complete loss of stability, incorrect learning where the model captures incorrect data correlation, convergence to low-rank saddles, and correct learning where the model captures the correct correlation. These phase boundaries are precisely quantified by the Lyapunov exponents of the dynamics. The obtained phase diagrams imply that SGD prefers low-rank saddles in a neural network when the underlying gradient is noisy, thereby influencing the learning performance.
Law of Balance and Stationary Distribution of Stochastic Gradient Descent
Ziyin, Liu, Li, Hongchao, Ueda, Masahito
The stochastic gradient descent (SGD) algorithm is the algorithm we use to train neural networks. However, it remains poorly understood how the SGD navigates the highly nonlinear and degenerate loss landscape of a neural network. In this work, we prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. Because the difference between a simple diffusion process and SGD dynamics is the most significant when symmetries are present, our theory implies that the loss function symmetries constitute an essential probe of how SGD works. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
What shapes the loss landscape of self-supervised learning?
Ziyin, Liu, Lubana, Ekdeep Singh, Ueda, Masahito, Tanaka, Hidenori
Prevention of complete and dimensional collapse of representations has recently become a design principle for self-supervised learning (SSL). However, questions remain in our theoretical understanding: When do those collapses occur? What are the mechanisms and causes? We answer these questions by deriving and thoroughly analyzing an analytically tractable theory of SSL loss landscapes. In this theory, we identify the causes of the dimensional collapse and study the effect of normalization and bias. Finally, we leverage the interpretability afforded by the analytical theory to understand how dimensional collapse can be beneficial and what affects the robustness of SSL against data imbalance.
Stochastic Neural Networks with Infinite Width are Deterministic
Ziyin, Liu, Zhang, Hanlin, Meng, Xiangming, Lu, Yuting, Xing, Eric, Ueda, Masahito
Applications of neural networks have achieved great success in various fields. A major extension of the standard neural networks is to make them stochastic, namely, to make the output a random function of the input. In a broad sense, stochastic neural networks include neural networks trained with dropout (Srivastava et al., 2014; Gal & Ghahramani, 2016), Bayesian networks (Mackay, 1992), variational autoencoders (VAE) (Kingma & Welling, 2013), and generative adversarial networks (Goodfellow et al., 2014). There are many reasons why one wants to make a neural network stochastic. Two main reasons are (1) regularization and (2) distribution modeling.
Interplay between depth of neural networks and locality of target functions
Mori, Takashi, Ueda, Masahito
Deep neural networks (DNNs) have achieved unparalleled success in various tasks of artificial intelligence such as image classification [1, 2] and speech recognition [3]. Empirically, DNNs often outperform other machine learning methods such as kernel methods and Gaussian processes, but little is known about the underlying mechanism of outstanding performance of DNNs. To elucidate benefits of depth, numerous studies have investigated properties of DNNs from various perspectives. The approximation theory focuses on the expressive power of DNNs [4]. Although the universal approximation theorem states that a sufficiently wide neural network with a single hidden layer can approximate any continuous functions, expressivity of a DNN grows exponentially with increasing the depth rather than the width [5-8]. In statistical learning theory, the decay rate of the generalization error in large sample asymptotics has been analyzed. For learning generic smooth functions, shallow networks or other standard methods with linear estimators such as kernel methods already give the optimal rate [9], and hence benefits of depth are not obvious.
A Convergent and Efficient Deep Q Network Algorithm
Wang, Zhikang T., Ueda, Masahito
Despite the empirical success of the deep Q network (DQN) reinforcement learning algorithm and its variants, DQN is still not well understood and it does not guarantee convergence. In this work, we show that DQN can diverge and cease to operate in realistic settings. Although there exist gradient-based convergent methods, we show that they actually have inherent problems in learning behaviour and elucidate why they often fail in practice. To overcome these problems, we propose a convergent DQN algorithm (C-DQN) by carefully modifying DQN, and we show that the algorithm is convergent and can work with large discount factors ( 0.9998). It learns robustly in difficult settings and can learn several difficult games in the Atari 2600 benchmark where DQN fail, within a moderate computational budget. Our codes have been publicly released and can be used to reproduce our results.
SGD May Never Escape Saddle Points
Ziyin, Liu, Li, Botao, Ueda, Masahito
Stochastic gradient descent (SGD) has been deployed to solve highly non-linear and non-convex machine learning problems such as the training of deep neural networks. However, previous works on SGD often rely on highly restrictive and unrealistic assumptions about the nature of noise in SGD. In this work, we mathematically construct examples that defy previous understandings of SGD. For example, our constructions show that: (1) SGD may converge to a local maximum; (2) SGD may escape a saddle point arbitrarily slowly; (3) SGD may prefer sharp minima over the flat ones; and (4) AMSGrad may converge to a local maximum. Our result suggests that the noise structure of SGD might be more important than the loss landscape in neural network training and that future research should focus on deriving the actual noise structure in deep learning.
Logarithmic landscape and power-law escape rate of SGD
Mori, Takashi, Ziyin, Liu, Liu, Kangqiao, Ueda, Masahito
Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of the SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a non-uniform transformation of the time variable. In the SDE, the gradient of the loss is replaced by that of the logarithmized loss. Consequently, we show that, near a local or global minimum, the stationary distribution $P_\mathrm{ss}(\theta)$ of the network parameters $\theta$ follows a power-law with respect to the loss function $L(\theta)$, i.e. $P_\mathrm{ss}(\theta)\propto L(\theta)^{-\phi}$ with the exponent $\phi$ specified by the mini-batch size, the learning rate, and the Hessian at the minimum. We obtain the escape rate formula from a local minimum, which is determined not by the loss barrier height $\Delta L=L(\theta^s)-L(\theta^*)$ between a minimum $\theta^*$ and a saddle $\theta^s$ but by the logarithmized loss barrier height $\Delta\log L=\log[L(\theta^s)/L(\theta^*)]$. Our escape-rate formula explains an empirical fact that SGD prefers flat minima with low effective dimensions.
On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes
Ziyin, Liu, Liu, Kangqiao, Mori, Takashi, Ueda, Masahito
The noise in stochastic gradient descent (SGD), caused by minibatch sampling, remains poorly understood despite its enormous practical importance in offering good training efficiency and generalization ability. In this work, we study the minibatch noise in SGD. Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge. We first derive the analytically solvable results for linear regression under various settings, which are compared to the commonly used approximations that are used to understand SGD noise. We show that some degree of mismatch between model and data complexity is needed in order for SGD to "cause" a noise, and that such mismatch may be due to the existence of static noise in the labels, in the input, the use of regularization, or underparametrization. Our results motivate a more accurate general formulation to describe minibatch noise.