LLM-Pruner: On the Structural Pruning of Large Language Models Xinyin Ma Gongfan Fang Xinchao Wang National University of Singapore
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the deployment, inference, and training stages. With LLM being a general-purpose task solver, we explore its compression in a task-agnostic manner, which aims to preserve the multi-task solving and language generation ability of the original LLM. One challenge to achieving this is the enormous size of the training corpus of LLM, which makes both data transfer and model post-training over-burdensome. Thus, we tackle the compression of LLMs within the bound of two constraints: being taskagnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM's functionality. To this end, the performance of pruned models can be efficiently recovered through tuning techniques, LoRA, in merely 3 hours, requiring only 50K data. We validate the LLM-Pruner on three LLMs, including LLaMA, Vicuna, and ChatGLM, and demonstrate that the compressed models still exhibit satisfactory capabilities in zero-shot classification and generation.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
From my understanding, the paper makes a good technical contribution, unifying a large body of work on isotonic regression (IR). The basic idea seems intuitive, and is to employ techniques from the fast solvers of linear systems. Thus, from the perspective of novelty and technical content, I cannot raise any issues (based on my limited understanding -- regrettably, I do not have the background to check the proofs). But my concern with the paper is simply that it may be better suited to a algorithms/theoretical CS conference or journal, such as those where the work it improves upon ([16] -- [20]), and the work it employs in developing the algorithm ([21] -- [29]) were published. It is unclear to me whether the results in the paper would be of sufficient interest to the broader NIPS community. In particular: - while IR has seen some interesting applications to learning problems of late, it is not (in my estimation) a core ML tool for which a faster algorithm is by itself of wide interest.
A Relations to Algorithmic Stability of SGD write if t
In this section, we formally introduce notions of algorithmic stability and relate them to results presented in the paper. Let denote a set of datapoints and Z a distribution over. It is well known (e.g., [9, 32]) that for any distribution Z and algorithm A, the following relations holds between the generalization gap, AAS, and UAS; [ Z However, considering Eq. (6), it remained unclear whether the AAS and generalization gap of these algorithms exhibit rates of similar order, in which case the UAS accurately captures the rate of the generalization gap. Interestingly, the answer to this question depends on whether sampling is done with or without replacement, as we discuss next. This follows from Eq. (6) and since the theorem establishes the generalization gap to be In Section 4, specifically in Corollary 2, we establish a generalization gap of (1/) for with-replacement SGD with a particular averaging scheme and a properly tuned step size.
The Generalization-Stability Tradeoff In Neural Network Pruning
Pruning neural network parameters is often viewed as a means to compress models, but pruning has also been motivated by the desire to prevent overfitting. This motivation is particularly relevant given the perhaps surprising observation that a wide variety of pruning approaches increase test accuracy despite sometimes massive reductions in parameter counts. To better understand this phenomenon, we analyze the behavior of pruning over the course of training, finding that pruning's benefit to generalization increases with pruning's instability (defined as the drop in test accuracy immediately following pruning). We demonstrate that this "generalization-stability tradeoff" is present across a wide variety of pruning settings and propose a mechanism for its cause: pruning regularizes similarly to noise injection. Supporting this, we find less pruning stability leads to more model flatness and the benefits of pruning do not depend on permanent parameter removal. These results explain the compatibility of pruning-based generalization improvements and the high generalization recently observed in overparameterized networks.
ef2ee09ea9551de88bc11fd7eeea93b0-AuthorFeedback.pdf
We thank the reviewers for their detailed, valuable reviews. These studies use a "more robust pruning regime" (R2) by pruning a constant fraction of all layers in all blocks with ResNet20 results, seem to apply to less "modern" regimes (R1), though modern regimes inspired our central question. R2,R3 Generalization gap vs. test accuracy; train accuracy not reported: Correct, we neglected to state that all We will update our manuscript to clearly discuss training accuracies and plot the generalization gaps. R3 "Pearson correlation and slope do not give an accurate characterization": Correct, the graphed relationships R2,R3 Methodology in "main body" and its "clarity": We will move methodological details to the "main body" R1, R3 Hyperparameter choices ("These networks reach much lower accuracy than expected... L1/L2 regularization It led to our exploring pruning of the last convolutional layers of VGG11/ResNet18. R2,R3 "[DSD] is worth a comparison" and "the claim... is hard to extract": We thank the reviewer for pointing DSD, we show that the parameters can reenter at zero or their original values (Figure D.2) while achieving the full R3 "[15] is not found to improve generalization": [15] (LeCun et al., 1990) says OBD improved test error in the last
Jiaqi Wang
This table is the same to Table 1 in the paper. We place it here for reading convenience. As shown in Table 1, FADI achieves new state-of-the-art performance on extremely few-shot scenarios, i.e., K=1, 2, 3 on novel split 1 and 3. However, the performance of FADI is slightly less-thansatisfactory on higher shot and novel split2. Here we analyze the possible reasons and summarize them as two limitations.
Jiaqi Wang 1,2 Tong Wu
Object detection has achieved substantial progress in the last decade. However, detecting novel classes with only few samples remains challenging, since deep learning under low data regime usually leads to a degraded feature space. Existing works employ a holistic fine-tuning paradigm to tackle this problem, where the model is first pre-trained on all base classes with abundant samples, and then it is used to carve the novel class feature space. Nonetheless, this paradigm is still imperfect. Durning fine-tuning, a novel class may implicitly leverage the knowledge of multiple base classes to construct its feature space, which induces a scattered feature space, hence violating the inter-class separability. To overcome these obstacles, we propose a two-step fine-tuning framework, Few-shot object detection via Association and DIscrimination (FADI), which builds up a discriminative feature space for each novel class with two integral steps.
A Supplementary Materials
A.1 Comparison with Existing Meta Learning-based Adversarial Attack Techniques Meta-Self [125] is a poisoning attack model for node classification by leveraging meta-learning to generate attacks, i.e., using meta-gradients to solve the bilevel optimization problem. It conducts adversarial attacks on global node classification of a single graph. It aims to solve a bilevel optimization problem: (1) training classification on graphs and (2) attacking graphs. It gradually improves attack performance by using meta learning to iteratively solve the above two problems. The GMA model utilizes meta learning to find good attack starting points in two graphs.