GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking
Chen, Patrick, Si, Si, Li, Yang, Chelba, Ciprian, Hsieh, Cho-Jui
–Neural Information Processing Systems
Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. For advanced NLP problems, a neural language model usually consists of recurrent layers (e.g., using LSTM cells), an embedding matrix for representing input tokens, and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of-the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90\% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words).
Neural Information Processing Systems
Feb-14-2020, 21:25:47 GMT
- Technology: