Appendix
–Neural Information Processing Systems
The form Equation (A.8) allows ustoapply chain rule tocalculate the gradient ofthe normalized Again, the chain rule is applied for the derivative of the weight matrix. Based on the gradient, one step of optimization under learning rateα could be expressed in a neat matrix multiplication format, decomposed by orthonormal basesU = {u1,u2,...}andV = {v1,v2,...}. The whole pruning framework is detailed in Algorithm 1. Grow fractionα is a function of training iterations that gradually decays forstability oftraining. ImageNet experiments are run on 8NVIDIATeslaV100s. Accordingly,thescheduleofAC/DCneed slight modifications based on the original setting.
Neural Information Processing Systems
Feb-7-2026, 07:47:07 GMT
- Technology: