DPKD 25+50 0.0001 0.841 1 2

Neural Information Processing Systems 

We run our experiments using PyTorch's distributed training on an Azure ML Nvidia DGX-2 In this section we present all the hyper-parameters used for training our models. We fix the gradient norm to be 1 and set the batch size as 1024 in all experiments based on [ 75, 34 ]. Structured pruning can be done by pruning attention heads, pruning encoder units, or pruning the embedding layer. KD is quite different from ours.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found