linear network
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > Middle East > Israel (0.04)
- North America > United States (0.14)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks
Watanabe, Taishi, Karakida, Ryo, Teramae, Jun-nosuke
The success of deep neural networks largely depends on the st atistical structure of the training data. While learning dynamics and generalization on iso tropic data are well-established, the impact of pronounced anisotropy on these crucial aspect s is not yet fully understood. We examine the impact of data anisotropy, represented by a sp iked covariance structure, a canonical yet tractable model, on the learning dynamics and generalization error of a two-layer linear network in a linear regression setting. Our ana lysis reveals that the learning dynamics proceed in two distinct phases, governed initiall y by the input-output correlation and subsequently by other principal directions of the data s tructure. Furthermore, we derive an analytical expression for the generalization error, quantifying how the alignment of the spike structure of the data with the learning task improv es performance. Our findings offer deep theoretical insights into how data anisotropy sha pes the learning trajectory and final performance, providing a foundation for understandin g complex interactions in more advanced network architectures.
- North America > United States (0.14)
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Implicit Bias of (Stochastic) Gradient Descent for Rank-1 Linear Neural Network
Studying the implicit bias of gradient descent (GD) and stochastic gradient descent (SGD) is critical to unveil the underlying mechanism of deep learning. Unfortunately, even for standard linear networks in regression setting, a comprehensive characterization of the implicit bias is still an open problem. This paper proposes to investigate a new proxy model of standard linear network, rank-1 linear network, where each weight matrix is parameterized as a rank-1 form. For over-parameterized regression problem, we precisely analyze the implicit bias of GD and SGD---by identifying a "potential" function such that GD converges to its minimizer constrained by zero training error (i.e., interpolation solution), and further characterizing the role of the noise introduced by SGD in perturbing the form of this potential. Our results explicitly connect the depth of the network and the initialization with the implicit bias of GD and SGD. Furthermore, we emphasize a new implicit bias of SGD jointly induced by stochasticity and over-parameterization, which can reduce the dependence of the SGD's solution on the initialization. Our findings regarding the implicit bias are different from that of a recently popular model, the diagonal linear network. We highlight that the induced bias of our rank-1 model is more consistent with standard linear network while the diagonal one is not. This suggests that the proposed rank-1 linear network might be a plausible proxy for standard linear net.
Deep linear networks for regression are implicitly regularized towards flat minima
The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate.
Exact marginal prior distributions of finite Bayesian neural networks
Bayesian neural networks are theoretically well-understood only in the infinite-width limit, where Gaussian priors over network weights yield Gaussian priors over network outputs. Recent work has suggested that finite Bayesian networks may outperform their infinite counterparts, but their non-Gaussian output priors have been characterized only though perturbative approaches. Here, we derive exact solutions for the function space priors for individual input examples of a class of finite fully-connected feedforward Bayesian neural networks. For deep linear networks, the prior has a simple expression in terms of the Meijer $G$-function. The prior of a finite ReLU network is a mixture of the priors of linear networks of smaller widths, corresponding to different numbers of active units in each layer. Our results unify previous descriptions of finite network priors in terms of their tail decay and large-width behavior.
A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point
Couto, Carlos, Mourão, José, Figueiredo, Mário A. T., Ribeiro, Pedro
Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.
- Europe > Portugal > Lisbon > Lisbon (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)