Goto

Collaborating Authors

 cheaply



FrugalML: How to use ML Prediction APIs more accurately and cheaply

Neural Information Processing Systems

Offering prediction APIs for fee is a fast growing industry and is an important aspect of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition. Across various tasks, FrugalML can achieve up to 90% cost reduction while matching the accuracy of the best single API, or up to 5% better accuracy while matching the best API's cost.



FrugalML: How to use ML Prediction APIs more accurately and cheaply

Neural Information Processing Systems

Offering prediction APIs for fee is a fast growing industry and is an important aspect of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition.


Review for NeurIPS paper: FrugalML: How to use ML Prediction APIs more accurately and cheaply

Neural Information Processing Systems

Additional Feedback: The paper covers the interesting topic of efficient API-reuse and, in general, presents a solid method with promising results. The result section is insightful, but am I missing how the conditional accuracies are estimated. From the paper I extract that you learn a model which performs instance-wise predictions, correct? How much left-out training data of the particular dataset (or other datasets) do you use for this? How easy/difficult is this task and do the results vary on the used datasets?


Review for NeurIPS paper: FrugalML: How to use ML Prediction APIs more accurately and cheaply

Neural Information Processing Systems

Each API has some predictive accuracy and quality score (confidence) but also has an assigned cost, which we'd like to minimize. The authors give a method to accomplish this: a base API is chosen based on learnt conditional accuracies which might be overruled by an add-on API if the quality score is not sufficiently high. The optimal strategy is generated via solving a stated optimization problem. The paper presents some neat experiments with this method on computer vision and NLP datasets with real-world APIs. These appear promising in that the generated strategy reduces costs while still achieving high predictive accuracies.



FrugalML: How to use ML Prediction APIs more accurately and cheaply

Neural Information Processing Systems

Offering prediction APIs for fee is a fast growing industry and is an important aspect of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition.


BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

arXiv.org Artificial Intelligence

Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities. Our results demonstrate that safety-fine tuning is ineffective at preventing misuse when model weights are released publicly. Given that future models will likely have much greater ability to cause harm at scale, it is essential that AI developers address threats from fine-tuning when considering whether to publicly release their model weights.


Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

arXiv.org Artificial Intelligence

Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficiency metric that is both (i) easily comparable across models from different providers, and (ii) representative of the true cost of running queries in an isolated performance environment. Unfortunately, access to LLMs today is largely restricted to black-box text generation APIs and raw runtimes measured through this interface do not satisfy these desiderata: model providers can apply various software and hardware optimizations orthogonal to the model, and models served on shared infrastructure are susceptible to performance contention. To circumvent these problems, we propose a new metric for comparing inference efficiency across models. This metric puts models on equal footing as though they were served (i) on uniform hardware and software, and (ii) without performance contention. We call this metric the \emph{idealized runtime}, and we propose a methodology to efficiently estimate this metric for autoregressive Transformer models. We also propose cost-aware variants that incorporate the number of accelerators needed to serve the model. Using these metrics, we compare ten state-of-the-art LLMs to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our methodology also facilitates the efficient comparison of different software and hardware stacks.