AITopics

doi: 10.37256/cnc.3220256807

2504.03668

Country: Europe (0.46)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.69)
Transportation (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

arXiv.org Artificial IntelligenceNov-24-2024

HiDP: Hierarchical DNN Partitioning for Distributed Inference on Heterogeneous Edge Platforms

Taufique, Zain, Vyas, Aman, Miele, Antonio, Liljeberg, Pasi, Kanduri, Anil

Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical DNN partitioning strategy (HiDP) for distributed inference on heterogeneous edge nodes. Our strategy hierarchically partitions DNN workloads at both global and local levels by considering the core-level heterogeneity of edge nodes. We evaluated our proposed HiDP strategy against relevant distributed inference techniques over widely used DNN models on commercial edge devices. On average our strategy achieved 38% lower latency, 46% lower energy, and 56% higher throughput in comparison with other relevant approaches.

artificial intelligence, machine learning, workload, (19 more...)

2411.16086

Country:

Europe > Finland > Southwest Finland > Turku (0.05)
North America > United States (0.04)
Europe > Italy > Lombardy > Milan (0.04)
Asia (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology (0.48)
Semiconductors & Electronics (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Architecture (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Stojkovic, Jovan, Zhang, Chaojie, Goiri, Íñigo, Torrellas, Josep, Choukse, Esha

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

arXiv.org Artificial IntelligenceAug-1-2024

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.

configuration, frequency, proceedings, (15 more...)

2408.00741

Country:

North America > United States > Illinois (0.04)
Asia (0.04)

Genre: Research Report (0.50)

Industry:

Energy (1.00)
Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

arXiv.org Artificial IntelligenceFeb-28-2024

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Miao, Xupeng, Oliaro, Gabriele, Cheng, Xinhao, Wu, Mengdi, Unger, Colin, Jia, Zhihao

Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.

arxiv preprint arxiv, flexllm, tensor, (10 more...)

2402.18789

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > San Diego County > Carlsbad (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJun-6-2023

InFi: End-to-End Learning to Filter Input for Resource-Efficiency in Mobile-Centric Inference

Yuan, Mu, Zhang, Lan, He, Fengxiang, Tong, Xueting, Song, Miao-Hui, Xu, Zhengyuan, Li, Xiang-Yang

Mobile-centric AI applications have high requirements for resource-efficiency of model inference. Input filtering is a promising approach to eliminate the redundancy so as to reduce the cost of inference. Previous efforts have tailored effective solutions for many applications, but left two essential questions unanswered: (1) theoretical filterability of an inference workload to guide the application of input filtering techniques, thereby avoiding the trial-and-error cost for resource-constrained mobile applications; (2) robust discriminability of feature embedding to allow input filtering to be widely effective for diverse inference tasks and input content. To answer them, we first formalize the input filtering problem and theoretically compare the hypothesis complexity of inference models and input filters to understand the optimization potential. Then we propose the first end-to-end learnable input filtering framework that covers most state-of-the-art methods and surpasses them in feature embedding with robust discriminability. We design and implement InFi that supports six input modalities and multiple mobile-centric deployments. Comprehensive evaluations confirm our theoretical results and show that InFi outperforms strong baselines in applicability, accuracy, and efficiency. InFi achieve 8.5x throughput and save 95% bandwidth, while keeping over 90% accuracy, for a video analytics application on mobile platforms.

data mining, machine learning, workload, (23 more...)

2209.13873

Country:

Asia > China > Anhui Province > Hefei (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California (0.04)
(6 more...)

Genre: Research Report > Promising Solution (0.54)

Industry: Information Technology (0.93)

Technology:

Information Technology > Internet of Things (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Mobile (1.00)
(5 more...)

#artificialintelligenceAug-3-2021, 17:11:53 GMT

Nvidia's Speedy New Inference Engine Keeps BERT Latency Within a Millisecond

Disappointment abounds when your data scientists dial in the accuracy on deep learning models to a high degree but are then eventually forced to gut the model for inference because of resource constraints. Fortunately, that will not happen often using the latest release of Nvidia's TensorRT inference engine, which can run the BERT-Large transformer model with less than a millisecond of latency, according to the AI systems maker. "Traditionally, training for AI is always done in the data center," Siddharth Sharma, Nvidia's head of product marketing for AI Software said in a July 19 (Monday) briefing. "You start with petabytes of data, hundreds of thousands of hours of speech data. You train the model to the highest accuracy that you can. And then once you trained it, you actually throw it over for inference."

accuracy, nvidia, sharma, (15 more...)

Country:

North America > United States > California > San Diego County > San Diego (0.05)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.05)

Industry: Information Technology > Hardware (0.90)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

#artificialintelligenceNov-15-2020, 13:05:16 GMT

Amazon begins shifting Alexa's cloud AI to its own silicon

On Thursday, an Amazon AWS blogpost announced that the company has moved most of the cloud processing for its Alexa personal assistant off of Nvidia GPUs and onto its own Inferentia Application Specific Integrated Circuit (ASIC). AWS Inferentia is a custom chip, built by AWS, to accelerate machine learning inference workloads and optimize their cost. Each NeuronCore implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, dramatically reducing latency and increasing throughput. When an Amazon customer--usually someone who owns an Echo or Echo dot--makes use of the Alexa personal assistant, very little of the processing is done on the device itself.

amazon begin, inferentia, own silicon, (4 more...)

Industry: Information Technology (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.41)

#artificialintelligenceDec-16-2019, 22:43:20 GMT

How to run machine learning at scale -- without going broke

Machine learning is computationally expensive -- and because serving real-time predictions means running your ML models in the cloud, that computational expense translates into real dollars. Put another way, if you wanted to add a translation feature to your app that automatically translated text to your user's preferred language, you would deploy an NLP model as a web API for your app to consume. To host this API, you would need to deploy it through a cloud provider like AWS, put it behind a load balancer, and implement some kind of autoscaling functionality (probably involving Docker and Kubernetes). None of the above is free, and if you're dealing with a large amount of traffic, the total cost can get out of hand. This is especially true if you aren't optimizing your spend.

inference, inference workload, infrastructure, (12 more...)

Industry: Information Technology (0.37)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

#artificialintelligenceNov-22-2019, 16:44:17 GMT

Growing Pains: Scaling Deep Learning Inference - IT Peer Network

Training an effective deep neural network is one thing, but deploying it in a way that keeps up with customer demand and is both performant and cost-efficient is hard. We've combined a heavily optimized software stack with deep learning-enabled hardware to fix that. There's an exciting change in the mix of problems that machine learning folks talk about. Teams have found their groove with data management and model training, and now have rapidly expanding user-bases. Of course, as great as it is to see your user graph go vertical, success comes with new problems.

inference, intel select solution, scaling deep learning inference, (11 more...)

Industry: Health & Medicine (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceNov-8-2019, 20:51:43 GMT

Nvidia targets neural networks in the datacentre with new benchmark

Nvidia has announced a series of new benchmarks tracking the performance of tools for running AI inference both at the edge and in the datacentre. The results of the MLPerf Inference 0.5, are the industry's first independent suite of AI benchmarks for inference and help to demonstrate the performance of NVIDIA Turing GPUs for datacentres and NVIDIA Xavier system-on-a-chip for edge computing. Nvidia posted the fastest results on new benchmarks measuring the performance of AI inference workloads in datacentres and at the edge -- building on the company's position in recent benchmarks measuring AI training. 'AI is at a tipping point as it moves swiftly from research to large-scale deployment for real applications,' said Ian Buck, general manager and vice president of Accelerated Computing at NVIDIA. 'AI inference is a tremendous computational challenge.

benchmark, new benchmark, nvidia, (14 more...)

Industry: Information Technology > Hardware (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.52)