QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Pegolotti, Tommaso, Frantar, Elias, Alistarh, Dan, Püschel, Markus

arXiv.org Artificial Intelligence 

Our approach is informed by the target architecture Focusing specifically on generative inference, where the and a performance model, including both hardware size of the weights is the main bottleneck, the currently bestperforming characteristics and method-specific accuracy method is GPTQ (Frantar et al., 2022), which constraints. Results on CPU-based inference achieves near-lossless quantization to 4-bit weights, and can for LLaMA models show that our approach even accurately support 2 and 3-bit weights by reducing the can lead to high performance and high accuracy, granularity to smaller weight groups, e.g., by jointly quantizing comparing favorably to the best existing blocks of 64 weights using a shared scale and zero-point.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found