plateau phenomenon
Data-Dependence of Plateau Phenomenon in Learning with Neural Network --- Statistical Mechanical Analysis
The plateau phenomenon, wherein the loss value stops decreasing during the process of learning, has been reported by various researchers. The phenomenon is actively inspected in the 1990s and found to be due to the fundamental hierarchical structure of neural network models. Then the phenomenon has been thought as inevitable. However, the phenomenon seldom occurs in the context of recent deep learning. There is a gap between theory and reality. In this paper, using statistical mechanical formulation, we clarified the relationship between the plateau phenomenon and the statistical property of the data learned. It is shown that the data whose covariance has small and dispersed eigenvalues tend to make the plateau phenomenon inconspicuous.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > Canada (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > Canada (0.04)
Reviews: Data-Dependence of Plateau Phenomenon in Learning with Neural Network --- Statistical Mechanical Analysis
It would make more sense to show results for data with low-dimensional structure, in which the first one or two are non-zero, and the rest are either zero or epsilon small. Do the conclusions for the two eigenvalues case still hold in this example? It is hard for me to see what I should learn from figures 5 and 6. - The dependence of the learning dynamics on the spectral properties of the input data is not new and was previously studies by Saxe et al. (ArXiv, 2013) for simple linear networks. It would be appropriate if these results were mentioned or discussed in the text. It has been previously showed that the initial conditions have a big impact on the trainability and learning dynamics of these networks. In this case, they would be defined as the initial conditions on the order parameters Q, R, and D. - The analysis here seems tractable only for networks with a small number of hidden units.
Reviews: Data-Dependence of Plateau Phenomenon in Learning with Neural Network --- Statistical Mechanical Analysis
This paper provides an analysis on dynamics of online learning of two-layer neural networks under the teacher-student scenario. The analysis extends that by Saad and Solla (1995) by considering a covariance matrix of the input which may not be proportional to the identity matrix. The main contribution of this paper is the finding that the plateau phenomenon observed in learning dynamics of nonlinear neural networks depends on statistics of input data. The three reviewers rated this paper above the acceptance threshold, mentioning originality and importance of the contribution of this paper. At the same time, two reviewers raised concern about clarity of presentation.
Data-Dependence of Plateau Phenomenon in Learning with Neural Network --- Statistical Mechanical Analysis
The plateau phenomenon, wherein the loss value stops decreasing during the process of learning, has been reported by various researchers. The phenomenon is actively inspected in the 1990s and found to be due to the fundamental hierarchical structure of neural network models. Then the phenomenon has been thought as inevitable. However, the phenomenon seldom occurs in the context of recent deep learning. There is a gap between theory and reality.
Noise-induced degeneration in online learning
Sato, Yuzuru, Tsutsui, Daiji, Fujiwara, Akio
The gradient descent is the simplest optimisation algorithm represented by gradient dynamics in a potential. When the input data is finite, gradient descent dynamics fluctuates due to the finite size effects, and is called stochastic gradient descent. In this paper, we study stability of stochastic gradient descent dynamics from the viewpoint of dynamical systems theory. Learning is characterised as nonautonomous dynamics driven by uncertain input from the external, and as multi-scale dynamics which consists of slow memory dynamics and fast system dynamics. When the uncertain input sequences are modelled by stochastic processes, dynamics of learning is described by a random dynamical system. In contrast to the traditional Fokker-Planck approaches [5, 15], the random dynamical system approaches enable the study not only of stationary distributions and global statistics, but also of the pathwise structure of stochastic dynamics. Based on nonautonomous and random dynamical system theory, it is possible to analyse stability and bifurcation in machine learning.
Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance
Ainsworth, Mark, Shin, Yeonjong
The ability of neural networks to provide `best in class' approximation across a wide range of applications is well-documented. Nevertheless, the powerful expressivity of neural networks comes to naught if one is unable to effectively train (choose) the parameters defining the network. In general, neural networks are trained by gradient descent type optimization methods, or a stochastic variant thereof. In practice, such methods result in the loss function decreases rapidly at the beginning of training but then, after a relatively small number of steps, significantly slow down. The loss may even appear to stagnate over the period of a large number of epochs, only to then suddenly start to decrease fast again for no apparent reason. This so-called plateau phenomenon manifests itself in many learning tasks. The present work aims to identify and quantify the root causes of plateau phenomenon. No assumptions are made on the number of neurons relative to the number of training data, and our results hold for both the lazy and adaptive regimes. The main findings are: plateaux correspond to periods during which activation patterns remain constant, where activation pattern refers to the number of data points that activate a given neuron; quantification of convergence of the gradient flow dynamics; and, characterization of stationary points in terms solutions of local least squares regression lines over subsets of the training data. Based on these conclusions, we propose a new iterative training method, the Active Neuron Least Squares (ANLS), characterised by the explicit adjustment of the activation pattern at each step, which is designed to enable a quick exit from a plateau. Illustrative numerical examples are included throughout.