Gradient Descent
Supplementary Material
It is worth noting that, Eq. In Section 4.1, we have shown the experimental results of HPM on two population synthetic functions, It is worth noting that, since the synthetic function only simulates the validation loss function ( i.e., The same exploit strategy in PBT, i.e., truncation selection [ All the codes on the synthetic functions were implemented with Autograd. Same to the Figure 1 in Section 4.1, we show the mean performance We show the details of hyperparameters we tuned on the benchmark datasets as follows. The tied weight is used for the embedding and softmax layer.
A used and training procedures
All the models are trained for 200 epochs with stochastic gradient descent with a batch size = 128, momentum = 0.9, and cosine All the hyperparameters were selected with a small grid search. From epoch 150 to epoch 185 the training error of the chunks with size 128/256 decreases below 0.5%, while for smaller chunk sizes it remains above 5%. Random chunks with sizes larger than 128/256 can fit the training set, thus having the same representational power as the whole network on the training data. For W > 128/256 the test accuracy is decaying approximately with the same law as that of independent networks with the same width (see Figure 1). This picture suggests that for CIFAR100 the size of a clone is 128/256, slightly larger than the size of the clones in CIFAR10.