Goto

Collaborating Authors

 subnet


Search for Efficient Large Language Models

Neural Information Processing Systems

Large Language Models (LLMs) have long held sway in the realm s of artificial intelligence research. Numerous efficient techniques, inc luding weight pruning, quantization, and distillation, have been embraced to comp ress LLMs, targeting memory reduction and inference acceleration, which unders core the redundancy in LLMs. However, most model compression techniques concen trate on weight optimization, overlooking the exploration of optimal arch itectures. Besides, traditional architecture search methods, limited by the eleva ted complexity with extensive parameters, struggle to demonstrate their effecti veness on LLMs. In this paper, we propose a training-free architecture search fram ework to identify optimal subnets that preserve the fundamental strengths of the o riginal LLMs while achieving inference acceleration. Furthermore, after gen erating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inher ited weights with a small amount of calibration data. Compared with SOT A training-fr ee structured pruning works that can generate smaller networks, our method dem onstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve infer ence acceleration.






NeuralAdditiveModels: InterpretableMachineLearningwithNeuralNets

Neural Information Processing Systems

They perform similarly to existing state-of-the-art generalized additive models in accuracy,but are more flexible because theyare based on neural nets instead ofboosted trees.


RethinkingImbalanceinImageSuper-Resolution forEfficientInference

Neural Information Processing Systems

Image super-resolution (SR) aims to reconstruct high-resolution (HR) images with more details from low-resolution (LR) images. Recently, deep learning-based image SR methods have made significant progress inreconstruction performance through deeper networkmodels andlarge-scale training datasets, but these improvements place higher demands on both computing power and memory resources, thus requiring more efficient solutions.


error is simply the

Neural Information Processing Systems

Figure (b) above shows that the performance is robust to different GCN embedding sizes. EA... degree to help": Figure (a) shows ablation study on NAS-Bench-201, which varies each component (surrogate The other experimental settings are the same as in Section 4.2. As can be seen, more accurate architectures are close to each other. BO typically works better in low-dimensional...": We use Here, in Figure (d) above, we use subnets that are sampled in the same search iteration. F or example, it is common to see pooling": Y es, we Thus, GCN propagation part is more important than how to add global node.



Search for Efficient Large Language Models

Neural Information Processing Systems

Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs.However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs.In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration.Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data.Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks.Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration.