MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness

Zheng, Zihao, Cui, Xiuping, Zheng, Size, Li, Maoliang, Chen, Jiayu, Yun, null, Liang, null, Chen, Xiang

arXiv.org Artificial Intelligence 

However, the parameter density of LLMs has struggled to keep pace with the diverse and increasing volumes of data to be processed. To address this limitation, the Mix-of-Experts (MoE) has emerged as one of the most promising LLM implementation approach [1]. An MoE model contains multiple "expert" networks, which consist of individual models or specialized layers. And each expert is trained to fit into a different aspect of the data. When deployed in a particular inference scenario, the MoE dynamically selects a subset of these experts to be sparsely activated, allowing the MoE to synthesize the corresponding data distribution [2-4]. Although MoE models demonstrate improved performance in terms of parameter scalability and memory efficiency with sparse activation, it still faces the need for parameter compression [5, 6]. As revealed by a large number of LLM compression studies, quantization has proven to be the most efficient compression method, which reduces model volume by refactoring parameters into low-precision numbers [7]. While, with the development of quantization techniques, the focus of methodology has gradually shifted from the parameters themselves to the mapping relationship between the parameters and the complex data inputs. Some methods, such as GPTQ [8], start to leverage data distribution analysis for establishing a data-parameter mapping to guide iterative channel-wise parameter quantization; And later methods further examine the relative data scale as well as its impact on data-parameter correlation and highlight the significant variation of parameters (e.g., SmoothQuant [9], A WQ [10]), thus achieving mixed precision quantization with better performance (e.g., Atom [11]).