linear convolutional network
Implicit Bias of Gradient Descent on Linear Convolutional Networks
We show that gradient descent on full-width linear convolutional networks of depth $L$ converges to a linear predictor related to the $\ell_{2/L}$ bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear SVM solution, regardless of depth.
The Riemannian Geometry associated to Gradient Flows of Linear Convolutional Networks
Achour, El Mehdi, Kohn, Kathlén, Rauhut, Holger
We study geometric properties of the gradient flow for learning deep linear convolutional networks. For linear fully connected networks, it has been shown recently that the corresponding gradient flow on parameter space can be written as a Riemannian gradient flow on function space (i.e., on the product of weight matrices) if the initialization satisfies a so-called balancedness condition. We establish that the gradient flow on parameter space for learning linear convolutional networks can be written as a Riemannian gradient flow on function space regardless of the initialization. This result holds for $D$-dimensional convolutions with $D \geq 2$, and for $D =1$ it holds if all so-called strides of the convolutions are greater than one. The corresponding Riemannian metric depends on the initialization.
Implicit Bias of Gradient Descent on Linear Convolutional Networks
We show that gradient descent on full-width linear convolutional networks of depth L converges to a linear predictor related to the \ell_{2/L} bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear SVM solution, regardless of depth.
Reviews: Implicit Bias of Gradient Descent on Linear Convolutional Networks
The paper considers the problem of formalizing the implicit bias of gradient descent on fully connected linear/convolutional networks with an exponential loss. Building on the recent work by Soudry et al. which considered a one layer neural network with no activation the paper generalizes the analysis to networks with greater depth (with no activations) and the exponential loss. The two main networks considered by the authors and the corresponding results are as follows. Linear Fully Connected Networks - In this setting the authors show that gradient descent in the limit converges to a predictor which in direction is the max margin predictor. This behaviour is the same as what was established in the earlier paper of Soudry et al for one layer neural networks.
Algebraic Complexity and Neurovariety of Linear Convolutional Networks
In this paper, we study linear convolutional networks with one-dimensional filters and arbitrary strides. The neuromanifold of such a network is a semialgebraic set, represented by a space of polynomials admitting specific factorizations. Introducing a recursive algorithm, we generate polynomial equations whose common zero locus corresponds to the Zariski closure of the corresponding neuromanifold. Furthermore, we explore the algebraic complexity of training these networks employing tools from metric algebraic geometry. Our findings reveal that the number of all complex critical points in the optimization of such a network is equal to the generic Euclidean distance degree of a Segre variety. Notably, this count significantly surpasses the number of critical points encountered in the training of a fully connected linear network with the same number of parameters.
Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm
Jagadeesan, Meena, Razenshteyn, Ilya, Gunasekar, Suriya
We study the function space characterization of the inductive bias resulting from controlling the $\ell_2$ norm of the weights in linear convolutional networks. We view this in terms of an induced regularizer in the function space given by the minimum norm of weights required to realize a linear function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network have a single channel, the induced regularizer for any $K$ is a norm given by a semidefinite program (SDP) that is independent of the number of output channels $C$. We further validate these results through a binary classification task on MNIST. (b) In contrast, for networks with multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias does depend on $C$. Further, for sufficiently large $C$, the induced regularizer for $K=1$ and $K=D$ are the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients -- both of which promote sparse structures.
Implicit Bias of Gradient Descent on Linear Convolutional Networks
Gunasekar, Suriya, Lee, Jason D., Soudry, Daniel, Srebro, Nati
We show that gradient descent on full-width linear convolutional networks of depth $L$ converges to a linear predictor related to the $\ell_{2/L}$ bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear SVM solution, regardless of depth. Papers published at the Neural Information Processing Systems Conference.
Implicit Bias of Gradient Descent on Linear Convolutional Networks
Gunasekar, Suriya, Lee, Jason D., Soudry, Daniel, Srebro, Nati
Large scale neural networks used in practice are highly over-parameterized with far more trainable model parameters compared to the number of training examples. Consequently, optimization objectives for learning such high capacity models have many global minima that fit training data perfectly. However, minimizing the training loss using specific optimization algorithms take us to not just any global minima, but some special global minima, e.g., global minima minimizing some regularizer R(β). In over-parameterized models, specially deep neural networks, much, if not most, of the inductive bias of the learned model comes from this implicit regularization from the optimization algorithm. Understanding the implicit bias, e.g., via characterizing R(β), is thus essential for understanding how and what the model learns.
Implicit Bias of Gradient Descent on Linear Convolutional Networks
Gunasekar, Suriya, Lee, Jason D., Soudry, Daniel, Srebro, Nati
Large scale neural networks used in practice are highly over-parameterized with far more trainable model parameters compared to the number of training examples. Consequently, optimization objectives for learning such high capacity models have many global minima that fit training data perfectly. However, minimizing the training loss using specific optimization algorithms take us to not just any global minima, but some special global minima, e.g., global minima minimizing some regularizer R(β). In over-parameterized models, specially deep neural networks, much, if not most, of the inductive bias of the learned model comes from this implicit regularization from the optimization algorithm. Understanding the implicit bias, e.g., via characterizing R(β), is thus essential for understanding how and what the model learns.