Supplementary Materials A Derivation for gradients

Neural Information Processing Systems 

For VGG, the pooling layers are replaced with convolutional layers that have a stride of 2, and the dropout is applied after fully connected (FC) layers. We use the Pytorch library to accelerate training with multi-GPU machines. We train all teacher ANNs for 200 epochs using an SGD optimizer with a momentum of 0.9 and weight decay of 5e