Goto

Collaborating Authors

 inference workload


Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge

arXiv.org Artificial Intelligence

Large Foundation Models (LFMs), including multi-modal and generative models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments, such as Multi-access Edge Computing (MEC), presents significant challenges for workload orchestration due to time-varying network, compute, and storage conditions. In particular, current split inference strategies, which partition LFM layers across nodes, are not designed to adapt to fluctuating workloads, dynamic bandwidth conditions, or evolving privacy constraints in high-utilization MEC environments. In this work, we propose a novel adaptive split inference orchestration framework that elevates both the placement and partitioning of LFM layers to runtime-tunable variables. Specifically, our framework enables real-time, quality-of-service (QoS)-aware management of inference workloads by extending conventional orchestrators with three key services: (1) Capacity-aware workload distribution, which continuously profiles node resources and selects an optimal subset of MEC nodes; (2) Dynamic partition migration, which transparently relocates pre-cut LFM segments in response to changes in utilization or network conditions; (3) Real-time reconfiguration, which dynamically re-splits LFM layers to balance latency, throughput, and privacy. We formalize the joint placement-partitioning problem, outline a reference architecture and algorithmic workflow, and discuss applicability in representative smart city, V2X, and industrial edge scenarios.


HiDP: Hierarchical DNN Partitioning for Distributed Inference on Heterogeneous Edge Platforms

arXiv.org Artificial Intelligence

Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical DNN partitioning strategy (HiDP) for distributed inference on heterogeneous edge nodes. Our strategy hierarchically partitions DNN workloads at both global and local levels by considering the core-level heterogeneity of edge nodes. We evaluated our proposed HiDP strategy against relevant distributed inference techniques over widely used DNN models on commercial edge devices. On average our strategy achieved 38% lower latency, 46% lower energy, and 56% higher throughput in comparison with other relevant approaches.


DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

arXiv.org Artificial Intelligence

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.


FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

arXiv.org Artificial Intelligence

Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.


InFi: End-to-End Learning to Filter Input for Resource-Efficiency in Mobile-Centric Inference

arXiv.org Artificial Intelligence

Mobile-centric AI applications have high requirements for resource-efficiency of model inference. Input filtering is a promising approach to eliminate the redundancy so as to reduce the cost of inference. Previous efforts have tailored effective solutions for many applications, but left two essential questions unanswered: (1) theoretical filterability of an inference workload to guide the application of input filtering techniques, thereby avoiding the trial-and-error cost for resource-constrained mobile applications; (2) robust discriminability of feature embedding to allow input filtering to be widely effective for diverse inference tasks and input content. To answer them, we first formalize the input filtering problem and theoretically compare the hypothesis complexity of inference models and input filters to understand the optimization potential. Then we propose the first end-to-end learnable input filtering framework that covers most state-of-the-art methods and surpasses them in feature embedding with robust discriminability. We design and implement InFi that supports six input modalities and multiple mobile-centric deployments. Comprehensive evaluations confirm our theoretical results and show that InFi outperforms strong baselines in applicability, accuracy, and efficiency. InFi achieve 8.5x throughput and save 95% bandwidth, while keeping over 90% accuracy, for a video analytics application on mobile platforms.


Nvidia's Speedy New Inference Engine Keeps BERT Latency Within a Millisecond

#artificialintelligence

Disappointment abounds when your data scientists dial in the accuracy on deep learning models to a high degree but are then eventually forced to gut the model for inference because of resource constraints. Fortunately, that will not happen often using the latest release of Nvidia's TensorRT inference engine, which can run the BERT-Large transformer model with less than a millisecond of latency, according to the AI systems maker. "Traditionally, training for AI is always done in the data center," Siddharth Sharma, Nvidia's head of product marketing for AI Software said in a July 19 (Monday) briefing. "You start with petabytes of data, hundreds of thousands of hours of speech data. You train the model to the highest accuracy that you can. And then once you trained it, you actually throw it over for inference."


Amazon begins shifting Alexa's cloud AI to its own silicon

#artificialintelligence

On Thursday, an Amazon AWS blogpost announced that the company has moved most of the cloud processing for its Alexa personal assistant off of Nvidia GPUs and onto its own Inferentia Application Specific Integrated Circuit (ASIC). AWS Inferentia is a custom chip, built by AWS, to accelerate machine learning inference workloads and optimize their cost. Each NeuronCore implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, dramatically reducing latency and increasing throughput. When an Amazon customer--usually someone who owns an Echo or Echo dot--makes use of the Alexa personal assistant, very little of the processing is done on the device itself.


How to run machine learning at scale -- without going broke

#artificialintelligence

Machine learning is computationally expensive -- and because serving real-time predictions means running your ML models in the cloud, that computational expense translates into real dollars. Put another way, if you wanted to add a translation feature to your app that automatically translated text to your user's preferred language, you would deploy an NLP model as a web API for your app to consume. To host this API, you would need to deploy it through a cloud provider like AWS, put it behind a load balancer, and implement some kind of autoscaling functionality (probably involving Docker and Kubernetes). None of the above is free, and if you're dealing with a large amount of traffic, the total cost can get out of hand. This is especially true if you aren't optimizing your spend.


Growing Pains: Scaling Deep Learning Inference - IT Peer Network

#artificialintelligence

Training an effective deep neural network is one thing, but deploying it in a way that keeps up with customer demand and is both performant and cost-efficient is hard. We've combined a heavily optimized software stack with deep learning-enabled hardware to fix that. There's an exciting change in the mix of problems that machine learning folks talk about. Teams have found their groove with data management and model training, and now have rapidly expanding user-bases. Of course, as great as it is to see your user graph go vertical, success comes with new problems.


Nvidia targets neural networks in the datacentre with new benchmark

#artificialintelligence

Nvidia has announced a series of new benchmarks tracking the performance of tools for running AI inference both at the edge and in the datacentre. The results of the MLPerf Inference 0.5, are the industry's first independent suite of AI benchmarks for inference and help to demonstrate the performance of NVIDIA Turing GPUs for datacentres and NVIDIA Xavier system-on-a-chip for edge computing. Nvidia posted the fastest results on new benchmarks measuring the performance of AI inference workloads in datacentres and at the edge -- building on the company's position in recent benchmarks measuring AI training. 'AI is at a tipping point as it moves swiftly from research to large-scale deployment for real applications,' said Ian Buck, general manager and vice president of Accelerated Computing at NVIDIA. 'AI inference is a tremendous computational challenge.