Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Neural Information Processing Systems 

As a cost-effective alternative, learning-free PTQ schemes have been proposed. However, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a significant feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found