Appendix: Learning Compact Representations of Neural Networks using DiscriminAtive Masking (DAM) AAnalysis of the DAMGate Function Dynamics During Training

Neural Information Processing Systems 

In this section, we theoretically analyze the dynamics of the DAM mask gi at the i-th layer as the training process unfolds. The loss function for training the neural network for the target task can then be denoted as L= L(f(x,Θ,βi)) (e.g., cross-entropy loss for supervised structured pruning problems and reconstruction error for representation learning problems), where xdenotes the input features to the neural network. Using gradient descent methods with a learning rate of η, the expected update formula of βi in DAM is given by: βi = ηEx Dtr [ βiL(f(x,Θ,βi)) + λ βiβi/(l 1)] (2) = ηEx Dtr [ βiL(f(x,Θ,βi))] ηλ/(l 1) (3) Let hi be the layer output before applying the DAM mask, and the masked output be represented as oi = hi gi after applying the gate. For the j-th neuron, gij/ βi = 0 if and only if ξj(βi)/ βi = 0. Since tanh(z) has non-zero gradients for z >0, the gradient of ξj(βi) is 0 only when kj/ni + βi 0, i.e., the mask value of the neuron is 0 (or in other words, it is deactivated or dead). Let us denote the set of all neuron indices with non-zero mask values (also referred to as active neurons) as J. Equation 4 can then be simplified as: βiL(f(x,Θ,βi)) = αi X We can make the following two observations: (i) only those neurons that are active (i.e., have non-zero mask values) have a contribution towards updating βi and moving the gate function. We name these neurons as support neurons and their position in the ordering of neurons as the transitioning zone of the gate function.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found