Parnell, Thomas
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services
Łazuka, Małgorzata, Anghel, Andreea, Parnell, Thomas
As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.
Accelerating Production LLMs with Combined Token/Embedding Speculators
Wertheimer, Davis, Rosenkranz, Joshua, Parnell, Thomas, Suneja, Sahil, Ranganathan, Pavithra, Ganti, Raghu, Srivatsa, Mudhakar
One approach to squaring this circle is speculative decoding, where a smaller draft model or speculator is trained This technical report describes the design and training to predict multiple tokens given a sequence of input. These of novel speculative decoding draft models, for accelerating speculative tokens are produced with low cost, and lower the inference speeds of large language models in a accuracy than the base LLM. However, we can leverage production environment. By conditioning draft predictions GPU parallelism during the LLM forward pass to evaluate on both context vectors and sampled tokens, we can train the output for each of these new tokens with minimal additional our speculators to efficiently predict high-quality n-grams, overhead. Then, by comparing the outputs to the which the base model then accepts or rejects. This allows us speculated inputs, we can accept all the predicted tokens to effectively predict multiple tokens per inference forward that match the output of the base model, while rejecting all pass, accelerating wall-clock inference speeds of highly optimized those that don't. In this way we can predict multiple tokens base model implementations by a factor of 2-3x. We per LLM forward pass at minimal extra cost. A deeper explore these initial results and describe next steps for further explanation of speculative decoding can be found in [3, 6].
SnapBoost: A Heterogeneous Boosting Machine
Parnell, Thomas, Anghel, Andreea, Lazuka, Malgorzata, Ioannou, Nikolas, Kurella, Sebastian, Agarwal, Peshal, Papandreou, Nikolaos, Pozidis, Haralampos
Modern gradient boosting software frameworks, such as XGBoost and LightGBM, implement Newton descent in a functional space. At each boosting iteration, their goal is to find the base hypothesis, selected from some base hypothesis class, that is closest to the Newton descent direction in a Euclidean sense. Typically, the base hypothesis class is fixed to be all binary decision trees up to a given depth. In this work, we study a Heterogeneous Newton Boosting Machine (HNBM) in which the base hypothesis class may vary across boosting iterations. Specifically, at each boosting iteration, the base hypothesis class is chosen, from a fixed set of subclasses, by sampling from a probability distribution. We derive a global linear convergence rate for the HNBM under certain assumptions, and show that it agrees with existing rates for Newton's method when the Newton direction can be perfectly fitted by the base hypothesis at each boosting iteration. We then describe a particular realization of a HNBM, SnapBoost, that, at each boosting iteration, randomly selects between either a decision tree of variable depth or a linear regressor with random Fourier features. We describe how SnapBoost is implemented, with a focus on the training complexity. Finally, we present experimental results, using OpenML and Kaggle datasets, that show that SnapBoost is able to achieve better generalization loss than competing boosting frameworks, without taking significantly longer to tune.
Differentially Private Stochastic Coordinate Descent
Damaskinos, Georgios, Mendler-Dünner, Celestine, Guerraoui, Rachid, Papandreou, Nikolaos, Parnell, Thomas
In this paper we tackle the challenge of making the stochastic coordinate descent algorithm differentially private. Compared to the classical gradient descent algorithm where updates operate on a single model vector and controlled noise addition to this vector suffices to hide critical information about individuals, stochastic coordinate descent crucially relies on keeping auxiliary information in memory during training. This auxiliary information provides an additional privacy leak and poses the major challenge addressed in this work. Driven by the insight that under independent noise addition, the consistency of the auxiliary information holds in expectation, we present DP-SCD, the first differentially private stochastic coordinate descent algorithm. We analyze our new method theoretically and argue that decoupling and parallelizing coordinate updates is essential for its utility. On the empirical side we demonstrate competitive performance against the popular stochastic gradient descent alternative (DP-SGD) while requiring significantly less tuning.
Breadth-first, Depth-next Training of Random Forests
Anghel, Andreea, Ioannou, Nikolas, Parnell, Thomas, Papandreou, Nikolaos, Mendler-Dünner, Celestine, Pozidis, Haris
In this paper we analyze, evaluate, and improve the performance of training Random Forest (RF) models on modern CPU architectures. An exact, state-of-the-art binary decision tree building algorithm is used as the basis of this study. Firstly, we investigate the trade-offs between using different tree building algorithms, namely breadth-first-search (BFS) and depth-search-first (DFS). We design a novel, dynamic, hybrid BFS-DFS algorithm and demonstrate that it performs better than both BFS and DFS, and is more robust in the presence of workloads with different characteristics. Secondly, we identify CPU performance bottlenecks when generating trees using this approach, and propose optimizations to alleviate them. The proposed hybrid tree building algorithm for RF is implemented in the Snap Machine Learning framework, and speeds up the training of RFs by 7.8x on average when compared to state-of-the-art RF solvers (sklearn, H2O, and xgboost) on a range of datasets, RF configurations, and multi-core CPU architectures.
Learning to Tune XGBoost with XGBoost
Sommer, Johanna, Sarigiannis, Dimitrios, Parnell, Thomas
In this short paper we investigate whether meta-learning techniques can be used to more effectively tune the hyperparameters of machine learning models using successive halving (SH). We propose a novel variant of the SH algorithm (MeSH), that uses meta-regressors to determine which candidate configurations should be eliminated at each round. We apply MeSH to the problem of tuning the hyperparameters of a gradient-boosted decision tree model. By training and tuning our meta-regressors using existing tuning jobs from 95 datasets, we demonstrate that MeSH can often find a superior solution to both SH and random search.
Weighted Sampling for Combined Model Selection and Hyperparameter Tuning
Sarigiannis, Dimitrios, Parnell, Thomas, Pozidis, Haris
The combined algorithm selection and hyperparameter tuning (CASH) problem is characterized by large hierarchical hyperparameter spaces. Model-free hyperparameter tuning methods can explore such large spaces efficiently since they are highly parallelizable across multiple machines. When no prior knowledge or meta-data exists to boost their performance, these methods commonly sample random configurations following a uniform distribution. In this work, we propose a novel sampling distribution as an alternative to uniform sampling and prove theoretically that it has a better chance of finding the best configuration in a worst-case setting. In order to compare competing methods rigorously in an experimental setting, one must perform statistical hypothesis testing. We show that there is little-to-no agreement in the automated machine learning literature regarding which methods should be used. We contrast this disparity with the methods recommended by the broader statistics literature, and identify the most suitable approach. We then select three popular model-free solutions to CASH and evaluate their performance, with uniform sampling as well as the proposed sampling scheme, across 67 datasets from the OpenML platform. We investigate the trade-off between exploration and exploitation across the three algorithms, and verify empirically that the proposed sampling distribution improves performance in all cases.
Addressing Algorithmic Bottlenecks in Elastic Machine Learning with Chicle
Kaufmann, Michael, Kourtis, Kornilios, Mendler-Dünner, Celestine, Schüpbach, Adrian, Parnell, Thomas
Distributed machine learning training is one of the most common and important workloads running on data centers today, but it is rarely executed alone. Instead, to reduce costs, computing resources are consolidated and shared by different applications. In this scenario, elasticity and proper load balancing are vital to maximize efficiency, fairness, and utilization. Currently, most distributed training frameworks do not support the aforementioned properties. A few exceptions that do support elasticity, imitate generic distributed frameworks and use micro-tasks. In this paper we illustrate that micro-tasks are problematic for machine learning applications, because they require a high degree of parallelism which hinders the convergence of distributed training at a pure algorithmic level (i.e., ignoring overheads and scalability limitations). To address this, we propose Chicle, a new elastic distributed training framework which exploits the nature of machine learning algorithms to implement elasticity and load balancing without micro-tasks. We use Chicle to train deep neural network as well as generalized linear models, and show that Chicle achieves performance competitive with state of the art rigid frameworks, while efficiently enabling elastic execution and dynamic load balancing.
Sampling Acquisition Functions for Batch Bayesian Optimization
De Palma, Alessandro, Mendler-Dünner, Celestine, Parnell, Thomas, Anghel, Andreea, Pozidis, Haralampos
This paper presents Acquisition Thompson Sampling (ATS), a novel algorithm for batch Bayesian Optimization (BO) based on the idea of sampling multiple acquisition functions from a stochastic process. We define this process through the dependency of the acquisition functions on a set of model parameters. ATS is conceptually simple, straightforward to implement and, unlike other batch BO methods, it can be employed to parallelize any sequential acquisition function. In order to improve performance for multi-modal tasks, we show that ATS can be combined with existing techniques in order to realize different explore-exploit trade-offs and take into account pending function evaluations. We present experiments on a variety of benchmark functions and on the hyper-parameter optimization of a popular gradient boosting tree algorithm. These demonstrate the competitiveness of our algorithm with two state-of-the-art batch BO methods, and its advantages to classical parallel Thompson Sampling for BO.
Snap ML: A Hierarchical Framework for Machine Learning
Dünner, Celestine, Parnell, Thomas, Sarigiannis, Dimitrios, Ioannou, Nikolas, Anghel, Andreea, Ravi, Gummadi, Kandasamy, Madhusudanan, Pozidis, Haralampos
We describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern computing systems. We prove theoretically that such a hierarchical system can accelerate training in distributed environments where intra-node communication is cheaper than inter-node communication. Additionally, we provide a review of the implementation of Snap ML in terms of GPU acceleration, pipelining, communication patterns and software architecture, highlighting aspects that were critical for achieving high performance. We evaluate the performance of Snap ML in both single-node and multi-node environments, quantifying the benefit of the hierarchical scheme and the data streaming functionality, and comparing with other widely-used machine learning software frameworks. Finally, we present a logistic regression benchmark on the Criteo Terabyte Click Logs dataset and show that Snap ML achieves the same test loss an order of magnitude faster than any of the previously reported results, including those obtained using TensorFlow and scikit-learn.