Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers
–Neural Information Processing Systems
As a cost-effective alternative, learning-free PTQ schemes have been proposed. However, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a significant feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency.
Neural Information Processing Systems
Oct-11-2025, 00:36:18 GMT