Magnificent Minified Models
Harang, Rich, Sanders, Hillary
–arXiv.org Artificial Intelligence
There are many ways to make a deep neural network smaller. In this paper, we focus on three categories of model size reduction: pruning, quantization, and training smaller models from scratch. Quantization means changing model parameters to lower-precision formats, like changing all 32-bit floating point parameters to 16-bit, which results in file size about half as large. Pruning deals with deleting parameters or groups of parameters (like entire neurons) from a trained model to make it smaller (often followed by a fine-tuning round of training, as done in our experiments). Parameter-level pruning (also called unstructured pruning) prunes individual parameters at a time, whereas neuron-level pruning (also called structured pruning) prunes all parameters associated with a given neuron at once. To simplify terminology across multiple methods we use the term'damage' to broadly refer to the undesired impact of removing a node or zeroing a weight on network performance. Different compression methods use different approaches to either estimate damage directly, or rank neurons or weights in order of increasing assumed damage according to some other metric that does not directly evaluate the impact on loss or performance.
arXiv.org Artificial Intelligence
Jun-16-2023