Analytic expressions for the output evolution of a deep neural network
Anastasia Borovykh December 19, 2019 Abstract We present a novel methodology based on a Taylor expansion of the network output for obtaining analytical expressions for the expected value of the network weights and output under stochastic training. Using these analytical expressions the effects of the hyperparameters and the noise variance of the optimization algorithm on the performance of the deep neural network are studied. In the early phases of training with a small noise coefficient, the output is equivalent to a linear model. In this case the network can generalize better due to the noise preventing the output from fully converging on the train data, however the noise does not result in any explicit regularization. In the later training stages, when higher order approximations are required, the impact of the noise becomes more significant, i.e. in a model which is nonlinear in the weights noise can regularize the output function resulting in better generalization as witnessed by its influence on the weight Hessian, a commonly used metric for generalization capabilities. Keywords: deep learning; Taylor expansion; stochastic gradient descent; regularization; generalization 1 Introduction With the large number of applications which are nowadays in some way using deep learning, it is of significant value to gain insight into the output evolution of a deep neural network and the effects that the model architecture and optimization algorithm have on it. A deep neural network is a complex model due to the nonlinear dependencies and the large number of parameters in the model. Understanding the network output and its generalization capabilities, i.e. how well a model optimized on train data will be able to perform on unseen test data, is thus a complex task. One way of gaining insight into the network is by studying it in a large-parameter limit, a setting in which its dynamics becomes analytically tractable. Such limits have been considered in e.g. The generalization capabilities and the definition of various quantities that measure these have been studied extensively. Previous work has shown that the norm [3], [27], [19], the width of a minimum in weight space [11], [34], the input sensitivity [28] and a model's compressibility [2] can be related (either theoretically or in practice) to the model's complexity and thus its ability to perform well on unseen data. Furthermore, it has been noted that the generalization capabilities can be influenced by the optimization algorithm used to train the model, e.g. it can be used to bias the model into configurations that are more robust to noise and have lower model complexity, see e.g. Furthermore, it has been observed that certain parameters of stochastic gradient descent (SGD) can be used to control the generalization error and the data fit, see e.g.
Dec-18-2019
- Country:
- Oceania > Australia
- Europe > Netherlands
- North Holland > Amsterdam (0.04)
- Asia > Middle East
- Jordan (0.04)
- Genre:
- Research Report (0.64)
- Technology: