linear neural network
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Lower Saxony > Gottingen (0.04)
- North America > United States > California > Yolo County > Davis (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > Canada (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (3 more...)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- North America > Canada (0.04)
Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks
When optimizing over-parameterized models, such as deep neural networks, a large set of parameters can achieve zero training error. In such cases, the choice of the optimization algorithm and its respective hyper-parameters introduces biases that will lead to convergence to specific minimizers of the objective. Consequently, this choice can be considered as an implicit regularization for the training of over-parametrized models. In this work, we push this idea further by studying the discrete gradient dynamics of the training of a two-layer linear network with the least-squares loss. Using a time rescaling, we show that, with a vanishing initialization and a small enough step size, this dynamics sequentially learns the solutions of a reduced-rank regression with a gradually increasing rank.
Representation Costs of Linear Neural Networks: Analysis and Design
For different parameterizations (mappings from parameters to predictors), we study the regularization cost in predictor space induced by $l_2$ regularization on the parameters (weights). We focus on linear neural networks as parameterizations of linear predictors. We identify the representation cost of certain sparse linear ConvNets and residual networks. In order to get a better understanding of how the architecture and parameterization affect the representation cost, we also study the reverse problem, identifying which regularizers on linear predictors (e.g., $l_p$ norms, group norms, the $k$-support-norm, elastic net) can be the representation cost induced by simple $l_2$ regularization, and designing the parameterizations that do so.
Faster Directional Convergence of Linear Neural Networks under Spherically Symmetric Data
In this paper, we study gradient methods for training deep linear neural networks with binary cross-entropy loss. In particular, we show global directional convergence guarantees from a polynomial rate to a linear rate for (deep) linear networks with spherically symmetric data distribution, which can be viewed as a specific zero-margin dataset. Our results do not require the assumptions in other works such as small initial loss, presumed convergence of weight direction, or overparameterization. We also characterize our findings in experiments.