Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks

Peleg, Amit, Hein, Matthias

arXiv.org Artificial Intelligence 

The implicit bias Neural networks typically generalize well when of Stochastic Gradient Descent (SGD) is thus often thought fitting the data perfectly, even though they are to be the main reason behind generalization (Arora et al., heavily overparameterized. Many factors have 2019; Shah et al., 2020). A recent thought-provoking study been pointed out as the reason for this phenomenon, by Chiang et al. (2023) suggests the idea of the volume hypothesis including an implicit bias of stochastic for generalization: well-generalizing basins of the gradient descent (SGD) and a possible simplicity loss occupy a significantly larger volume in the weight space bias arising from the neural network architecture. of neural networks than basins that do not generalize well. The goal of this paper is to disentangle the factors They argue that the generalization performance of neural that influence generalization stemming from optimization networks is primarily a bias of the architecture and that the and architectural choices by studying implicit bias of SGD is only a secondary effect. To this end, random and SGD-optimized networks that achieve they randomly sample networks that achieve zero training zero training error. We experimentally show, in error (which they term Guess and Check (G&C)) and argue the low sample regime, that overparameterization that the generalization performance of these networks is in terms of increasing width is beneficial for generalization, qualitatively similar to networks found by SGD. and this benefit is due to the bias of SGD and not due to an architectural bias. In contrast, In this work, we revisit the approach of Chiang et al. (2023) for increasing depth, overparameterization and study it in detail to disentangle the effects of implicit is detrimental for generalization, but random and bias of SGD from a potential bias of the choice of architecture. SGD-optimized networks behave similarly, so this As we have to compare to randomly sampled neural can be attributed to an architectural bias.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found