QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
Pegolotti, Tommaso, Frantar, Elias, Alistarh, Dan, Püschel, Markus
–arXiv.org Artificial Intelligence
Our approach is informed by the target architecture Focusing specifically on generative inference, where the and a performance model, including both hardware size of the weights is the main bottleneck, the currently bestperforming characteristics and method-specific accuracy method is GPTQ (Frantar et al., 2022), which constraints. Results on CPU-based inference achieves near-lossless quantization to 4-bit weights, and can for LLaMA models show that our approach even accurately support 2 and 3-bit weights by reducing the can lead to high performance and high accuracy, granularity to smaller weight groups, e.g., by jointly quantizing comparing favorably to the best existing blocks of 64 weights using a shared scale and zero-point.
arXiv.org Artificial Intelligence
Jul-7-2023