pruning pattern
From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction
Maximov, Egor, Kuzkina, Yulia, Kanametov, Azamat, Prutko, Alexander, Goncharov, Aleksei, Zhelnin, Maxim, Shvetsov, Egor
As large language models (LLMs) grow in size, efficient compression techniques like quantization and sparsification are critical. While quantization maintains performance with reduced precision, structured sparsity methods, such as N:M sparsification, often fall short due to limited flexibility, and sensitivity to outlier weights. We explore 8:16 semi-structured sparsity, demonstrating its ability to surpass the Performance Threshold-where a compressed model matches the accuracy of its uncompressed or smaller counterpart under equivalent memory constraints. Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead (0.875 vs. 0.75 bits/element). We also apply sparse structured patterns for salient weights, showing that structured sparsity for outliers is competitive with unstructured approaches leading to equivalent or better results. Finally, we demonstrate that simple techniques such as variance correction and SmoothQuant like weight equalization improve sparse models performance.
EvoP: Robust LLM Inference via Evolutionary Pruning
Wu, Shangyu, Du, Hongchao, Xiong, Ying, Chen, Shuai, Kuo, Tei-wei, Guan, Nan, Xue, Chun Jason
Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing structured pruning methods address this issue by removing redundant structures (e.g., elements, channels, layers) from the model. However, these methods employ a heuristic pruning strategy, which leads to suboptimal performance. Besides, they also ignore the data characteristics when pruning the model. To overcome these limitations, we propose EvoP, an evolutionary pruning framework for robust LLM inference. EvoP first presents a cluster-based calibration dataset sampling (CCDS) strategy for creating a more diverse calibration dataset. EvoP then introduces an evolutionary pruning pattern searching (EPPS) method to find the optimal pruning pattern. Compared to existing structured pruning techniques, EvoP achieves the best performance while maintaining the best efficiency. Experiments across different LLMs and different downstream tasks validate the effectiveness of the proposed EvoP, making it a practical and scalable solution for deploying LLMs in real-world applications.
Multiobjective Evolutionary Pruning of Deep Neural Networks with Transfer Learning for improving their Performance and Robustness
Poyatos, Javier, Molina, Daniel, Martรญnez, Aitor, Del Ser, Javier, Herrera, Francisco
Evolutionary Computation algorithms have been used to solve optimization problems in relation with architectural, hyper-parameter or training configuration, forging the field known today as Neural Architecture Search. These algorithms have been combined with other techniques such as the pruning of Neural Networks, which reduces the complexity of the network, and the Transfer Learning, which lets the import of knowledge from another problem related to the one at hand. The usage of several criteria to evaluate the quality of the evolutionary proposals is also a common case, in which the performance and complexity of the network are the most used criteria. This work proposes MO-EvoPruneDeepTL, a multi-objective evolutionary pruning algorithm. \proposal uses Transfer Learning to adapt the last layers of Deep Neural Networks, by replacing them with sparse layers evolved by a genetic algorithm, which guides the evolution based in the performance, complexity and robustness of the network, being the robustness a great quality indicator for the evolved models. We carry out different experiments with several datasets to assess the benefits of our proposal. Results show that our proposal achieves promising results in all the objectives, and direct relation are presented among them. The experiments also show that the most influential neurons help us explain which parts of the input images are the most relevant for the prediction of the pruned neural network. Lastly, by virtue of the diversity within the Pareto front of pruning patterns produced by the proposal, it is shown that an ensemble of differently pruned models improves the overall performance and robustness of the trained networks.
EVE: Environmental Adaptive Neural Network Models for Low-power Energy Harvesting System
Islam, Sahidul, Zhou, Shanglin, Ran, Ran, Jin, Yufang, Wen, Wujie, Ding, Caiwen, Xie, Mimi
However, when IoT devices are increasingly being implemented with neural network DNN models come to on-board, there is a grand challenge to accommodate models to enable smart applications. Energy harvesting (EH) the giant models to tiny IoT devices with limited memory technology that harvests energy from ambient environment is a and computing resources [3, 11-13, 20, 22]. Particularly, first, embedded promising alternative to batteries for powering those devices due IoT devices have limited computational units and low CPU to the low maintenance cost and wide availability of the energy frequency (e.g., 1-16MHZ). Since DNNs are computationally expensive, sources. However, the power provided by the energy harvester is DNN algorithm takes long on-board execution time. Second, low and has an intrinsic drawback of instability since it varies with embedded IoT devices are equipped with small memory (e.g., hundreds the ambient environment. This paper proposes EVE, an automated of KBs) which can not even afford tiny DNN models (e.g., machine learning (autoML) co-exploration framework to search Tens of MBs). Third, these battery-powered devices naturally have for desired multi-models with shared weights for energy harvesting a limited standby time.