Staffler, Benedikt
Large Language Model Compression with Neural Architecture Search
Sukthanker, Rhea Sanjay, Staffler, Benedikt, Hutter, Frank, Klein, Aaron
Large language models (LLMs) exhibit remarkable reasoning abilities, allowing them to generalize across a wide range of downstream tasks, such as commonsense reasoning or instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. This poses the question: Can we compress pre-trained LLMs to meet diverse size and latency requirements? We leverage Neural Architecture Search (NAS) to compress LLMs by pruning structural components, such as attention heads, neurons, and layers, aiming to achieve a Pareto-optimal balance between performance and efficiency. While NAS already achieved promising results on small language models in previous work, in this paper we propose various extensions that allow us to scale to LLMs. Compared to structural pruning baselines, we show that NAS improves performance up to 3.4% on MMLU with an on-device latency speedup.
HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models
Sukthanker, Rhea Sanjay, Zela, Arber, Staffler, Benedikt, Klein, Aaron, Purucker, Lennart, Franke, Joerg K. H., Hutter, Frank
The increasing size of language models necessitates a thorough analysis across multiple dimensions to assess trade-offs among crucial hardware metrics such as latency, energy consumption, GPU memory usage, and performance. Identifying optimal model configurations under specific hardware constraints is becoming essential but remains challenging due to the computational load of exhaustive training and evaluation on multiple devices. To address this, we introduce HW-GPT-Bench, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 774M parameters. Our surrogates, via calibrated predictions and reliable uncertainty estimates, faithfully model the heteroscedastic noise inherent in the energy and latency measurements. To estimate perplexity, we employ weight-sharing techniques from Neural Architecture Search (NAS), inheriting pretrained weights from the largest GPT-2 model. Finally, we demonstrate the utility of HW-GPT-Bench by simulating optimization trajectories of various multi-objective optimization algorithms in just a few seconds.
Multi-objective Differentiable Neural Architecture Search
Sukthanker, Rhea Sanjay, Zela, Arber, Staffler, Benedikt, Dooley, Samuel, Grabocka, Josif, Hutter, Frank
Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives like neural network training. Typically, in MOO neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints into the objective function, but profiling the Pareto front necessitates a search for each constraint. In this work, we propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics, and yields representative and diverse architectures across multiple devices in just one search run. To this end, we parameterize the joint architectural distribution across devices and multiple objectives via a hypernetwork that can be conditioned on hardware features and preference vectors, enabling zero-shot transferability to new devices. Extensive experiments with up to 19 hardware devices and 3 objectives showcase the effectiveness and scalability of our method. Finally, we show that, without additional costs, our method outperforms existing MOO NAS methods across qualitatively different search spaces and datasets, including MobileNetV3 on ImageNet-1k and a Transformer space on machine translation.
Bag of Tricks for Neural Architecture Search
Elsken, Thomas, Staffler, Benedikt, Zela, Arber, Metzen, Jan Hendrik, Hutter, Frank
This allows to search for architectures by using alternating stochastic gradient descent, which (in each batch) iterates While neural architecture search methods have been successful updates of the network parameters and the real-valued in previous years and led to new state-of-the-art performance weights parameterizing the architecture. However, directly on various problems, they have also been criticized using this alternating optimization has been reported to lead for being unstable, being highly sensitive with respect to premature convergence in the architectural space [26].