Goto

Collaborating Authors

 generalization gap


Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Neural Information Processing Systems

Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the generalization gap phenomenon. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase.


Appendix A Proof of Theorem 2.1

Neural Information Processing Systems

We have the following lemma. Using the notation of Lemma A.1, we have E The third inequality uses the Lipschitz assumption of the loss function. Figure 10 supplements'Relation to disagreement ' at the end of Section 2. It shows an example where the behavior of inconsistency is different from disagreement. All the experiments were done using GPUs (A100 or older). The goal of the experiments reported in Section 3.1 was to find whether/how the predictiveness of The arrows indicate the direction of training becoming longer.






On the Limitations of Fractal Dimension as a Measure of Generalization Charlie B. Tan University of Oxford Inés García-Redondo Imperial College London Qiquan Wang

Neural Information Processing Systems

Bounding and predicting the generalization gap of overparameterized neural networks remains a central open problem in theoretical machine learning. There is a recent and growing body of literature that proposes the framework of fractals to model optimization trajectories of neural networks, motivating generalization bounds and measures based on the fractal dimension of the trajectory. Notably, the persistent homology dimension has been proposed to correlate with the generalization gap.


NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations

Marco Ciccone, Marco Gallieri, Jonathan Masci, Christian Osendorfer, Faustino Gomez

Neural Information Processing Systems

Each block represents atime-invariant iterativeprocess as the first layer in thei-th block,xi(1), is unrolled into a pattern-dependent number,Ki, of processing stages, using weight matricesAi andBi. The skip connections from the input,ui, to all layers in blockimake the process nonautonomous. Blocks can be chained together (each block modeling adifferent latent space) by passing final latentrepresentation,xi(Ki),ofblockiastheinputtoblocki+1.