Goto

Collaborating Authors

 cifar100



Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

Neural Information Processing Systems

Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalanceis based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing.





equizero_neurips23_format

Neural Information Processing Systems

Proof of Thm. 2. We want to show M G(hx)= hM G(x) for all x 2X and h 2 G. From the definition of M G in equation 4, we have M G(hx)= 1P Similar to Yarotsky (2022), we first define Ksym = S g2G gK. Note that Ksym is also a compact set and Ksym X . We want to show that M G,equi(gx)= gM G,equi(x). Hence, ( h(gx) 1gx) is invariant to actions of G. The proof for invariance of M G,inv(x) follows similarly. In addition to properties discussed in section 3.3, here we show that equizero models have autoregressive and invertibility properties. These properties have not been used in the main paper, but we believe they could be of use for future work in this area.


Diffused Redundancy

Neural Information Processing Systems

A.1 CKADefinition In all our evaluations we use CKA with a linear kernel [24] which essentially amounts to the following steps: A.2 Additional CKA results Fig 9 shows CKA comparison between randomly chosen parts of the layer and the full layer for different kinds of ResNet50. We observe that even ResNet50 trained with MRL loss shows a significant amount of diffused redundancy. Figure 9: [Comparison of Diffused Redundancy in MRL vs other losses, through the lens of CKA] We see a similar trend as reported in Fig 7 in the main paper, where even the MRL model shows a significant amount of diffused redundancy despite being explicitly trained to instead have structured redundancy. The amount of diffused redundancy however is much lesser than the resnets trained using the standard loss and adv. Here we list the sources of weights for the various pre-trained models used in our experiments: ResNet18 trained on ImageNet1k using standard loss: taken from timmv0.6.1.