QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Pegolotti, Tommaso, Frantar, Elias, Alistarh, Dan, Püschel, Markus

Jul-7-2023–arXiv.org Artificial Intelligence

Our approach is informed by the target architecture Focusing specifically on generative inference, where the and a performance model, including both hardware size of the weights is the main bottleneck, the currently bestperforming characteristics and method-specific accuracy method is GPTQ (Frantar et al., 2022), which constraints. Results on CPU-based inference achieves near-lossless quantization to 4-bit weights, and can for LLaMA models show that our approach even accurately support 2 and 3-bit weights by reducing the can lead to high performance and high accuracy, granularity to smaller weight groups, e.g., by jointly quantizing comparing favorably to the best existing blocks of 64 weights using a shared scale and zero-point.

artificial intelligence, natural language, quantization, (15 more...)

arXiv.org Artificial Intelligence

Jul-7-2023

arXiv.org PDF

Add feedback

Country:
- Europe (0.14)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found