Goto

Collaborating Authors

 torchserve


On the Cost of Model-Serving Frameworks: An Experimental Evaluation

De Rosa, Pasquale, Bromberg, Yérom-David, Felber, Pascal, Mvondo, Djob, Schiavoni, Valerio

arXiv.org Artificial Intelligence

In machine learning (ML), the inference phase is the process of applying pre-trained models to new, unseen data with the objective of making predictions. During the inference phase, end-users interact with ML services to gain insights, recommendations, or actions based on the input data. For this reason, serving strategies are nowadays crucial for deploying and managing models in production environments effectively. These strategies ensure that models are available, scalable, reliable, and performant for real-world applications, such as time series forecasting, image classification, natural language processing, and so on. In this paper, we evaluate the performances of five widely-used model serving frameworks (TensorFlow Serving, TorchServe, MLServer, MLflow, and BentoML) under four different scenarios (malware detection, cryptocoin prices forecasting, image classification, and sentiment analysis). We demonstrate that TensorFlow Serving is able to outperform all the other frameworks in serving deep learning (DL) models. Moreover, we show that DL-specific frameworks (TensorFlow Serving and TorchServe) display significantly lower latencies than the three general-purpose ML frameworks (BentoML, MLFlow, and MLServer).


Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving

Bhardwaj, Ankit, Phanishayee, Amar, Narayanan, Deepak, Tarta, Mihail, Stutsman, Ryan

arXiv.org Artificial Intelligence

In this paper, we investigate how to push the performance limits of serving Deep Neural Network (DNN) models on CPU-based servers. Specifically, we observe that while intra-operator parallelism across multiple threads is an effective way to reduce inference latency, it provides diminishing returns. Our primary insight is that instead of running a single instance of a model with all available threads on a server, running multiple instances each with smaller batch sizes and fewer threads for intra-op parallelism can provide lower inference latency. However, the right configuration is hard to determine manually since it is workload- (DNN model and batch size used by the serving system) and deployment-dependent (number of CPU cores on server). We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime. Averaged across a range of batch sizes, Packrat improves inference latency by 1.43$\times$ to 1.83$\times$ on a range of commonly used DNNs.


Serving ML Models with TorchServe

#artificialintelligence

This post will walk you through a process of serving your deep learning Torch model with the TorchServe framework. There are quite a bit of articles about this topic. However, typically they are focused either on deploying TorchServe itself or on writing custom handlers and getting the end results. That was a motivation for me to write this post. It covers both parts and gives end-to-end example.


Top Tools To Do Machine Learning Serving In Production

#artificialintelligence

Creating a model is one thing, but using that model in production is quite another. The next step after a data scientist completes a model is to deploy it so that it can serve the application. Batch and online model serving are the two main categories. Batch refers to feeding a large amount of data into a model and writing the results to a table, usually as a scheduled operation. You must deploy the model online using an endpoint for applications to send a request to the model and receive a quick response with no latency.


Serving PyTorch Models Using TorchServe - Supertype

#artificialintelligence

Model serving has always been a crucial process in MLOps as it decides whether an AI product will be accessible to the user. Upon developing a model that can perform a certain task, the next step is to serve the model so that it is accessible through an API, hence enabling applications to incorporate AI into the system. This process also includes model monitoring and management, which gives the ability to ensure that the model can function properly and scale the model on demand. Various tools have been built as a solution to serve models. Don't worry if some of the terms does not make any sense to you yet.


GitHub - pytorch/serve: Serve, optimize and scale PyTorch models in production

#artificialintelligence

TorchServe is a flexible and easy to use tool for serving and scaling PyTorch models in production. To learn more about how to contribute, see the contributor guide here. This repository is jointly operated and maintained by Amazon, Meta and a number of individual contributors listed in the CONTRIBUTORS file. For questions directed at Meta, please send an email to opensource@fb.com. For questions directed at Amazon, please send an email to torchserve@amazon.com.


MODEL SERVING IN PYTORCH

#artificialintelligence

Deploying ML models in Production and scaling your ML services still continue to be big challenge. TorchServe, the model serving solution for PyTorch solves this problem and has now evolved into a multi-platform solution that can run on-prem or on any cloud with integrations for major OSS platforms like Kubernetes, MLflow, Kubeflow Pipelines, KServe. This talk will cover new features launched in TorchServe like model interpretability using Captum, best practices for production deployments in a responsible manner, along with examples of how companies like Amazon Ads, Meta AI and broader PyTorch community are using TorchServe.


Azure ML (AML) Alternatives for MLOps - neptune.ai

#artificialintelligence

Azure Machine Learning (AML) is a cloud-based machine learning service for data scientists and ML engineers. You can use AML to manage the machine learning lifecycle--train, develop, and test models, but also run MLOps processes with speed, efficiency, and quality. For organizations that want to scale ML operations and unlock the potential of AI, tools like AML are important. Creating machine learning solutions that drive business growth becomes much easier. But what if you don't need a comprehensive MLOps solution like AML? Maybe you want to build your own stack, and need specific tools for tasks like tracking, deployment, or for managing other key parts of MLOps? Experiment tracking documents every piece of information that you care about during your ML experiments. Machine learning is an iterative process, so this is really important. Azure ML provides experimental tracking for all metrics in the machine learning environment.


8 Alternatives to TensorFlow Serving

#artificialintelligence

TensorFlow Serving is an easy-to-deploy, flexible and high performing serving system for machine learning models built for production environments. It allows easy deployment of algorithms and experiments while allowing developers to keep the same server architecture and APIs. TensorFlow Serving provides seamless integration with TensorFlow models, and can also be easily extended to other models and data. Open-source platform Cortex makes execution of real-time inference at scale seamless. It is designed to deploy trained machine learning models directly as a web service in production.


Bootstrap your own Handler: How and why to create custom handlers for PyTorch's TorchServe

#artificialintelligence

TorchServe is a great tool to deploy trained PyTorch models, there is no denying that. But, as with any relatively new project, it is still creating a community around it to help with the more niche aspects of its implementation. As part of this community, we can contribute to this. So today, we will be discussing how to develop advanced custom handlers with PyTorch's TorchServe. We will also be reviewing the process of saving your PyTorch model with torch-model-archiver and how to include all the new artifacts created while we are at it.