Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
–Neural Information Processing Systems
Transformer architecture has become the fundamental element of the widespread natural language processing (NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there.
Neural Information Processing Systems
Oct-11-2024, 13:05:47 GMT
- Technology: