Goto

Collaborating Authors

 training loss




Appendix A Proof of Theorem 2.1

Neural Information Processing Systems

We have the following lemma. Using the notation of Lemma A.1, we have E The third inequality uses the Lipschitz assumption of the loss function. Figure 10 supplements'Relation to disagreement ' at the end of Section 2. It shows an example where the behavior of inconsistency is different from disagreement. All the experiments were done using GPUs (A100 or older). The goal of the experiments reported in Section 3.1 was to find whether/how the predictiveness of The arrows indicate the direction of training becoming longer.








f7ede9414083fceab9e63d9100a80b36-Supplemental-Conference.pdf

Neural Information Processing Systems

This pruning algorithm then assigns an importance scorekdRdwlwlk to each weight, and remove the weights receiving the lowest such scores. In Figure 8, we plot the generalization of the family of models each aforementioned algorithm generates as a function of sparsities and training time in epochs. In Section 1, We show that the augmented training algorithm produces VGG-16 models withgeneralization thatisindistinguishable fromthatofmodels thatpruning withlearning rate rewinding produces. We refer to the topK% of training examples whose training loss improves the most during pruning as thetop-improved examples. To examine the influence of these top-improved examples ongeneralization, for each sparsity pruning reaches, we train twodense models ontwo datasets respectively: a). the original training dataset excluding the top-improved examples at the specifiedsparsity,whichwedenoteasTIE(Top-ImprovedExamples);b).