A Appendices
–Neural Information Processing Systems
A.1 Guarantees on the decrease of the training loss As the scores are updated, the relative order of the importances is likely shuffled, and some connections will be replaced by more important ones. Under certain conditions, we are able to formally prove that as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from [Ramanujan et al., 2020] to consider the case of fine-tuable W. We suppose that (a) the training loss L is smooth and admits a first-order Taylor development everywhere it is defined and (b) the learning rate of W (α We first consider the case where k = 1 in the TopK masking, meaning that only one connection is remaining (and the other weights are deactivated/masked). The first term is null because of inequalities (6) and the second term is negative because of inequality (7). We note that this proof is not specific to the TopK masking function.
Neural Information Processing Systems
Mar-21-2025, 15:29:19 GMT
- Technology: