GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Chen, Patrick, Si, Si, Li, Yang, Chelba, Ciprian, Hsieh, Cho-Jui

Feb-14-2020, 21:25:47 GMT–Neural Information Processing Systems

Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. For advanced NLP problems, a neural language model usually consists of recurrent layers (e.g., using LSTM cells), an embedding matrix for representing input tokens, and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of-the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90\% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words).

deep learning, low-rank approximation, neural network, (10 more...)

Neural Information Processing Systems

Feb-14-2020, 21:25:47 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)