Goto

Collaborating Authors

 stall


Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation

Zarkadas, Ioannis, Tomlinson, Amanda, Cidon, Asaf, Kasikci, Baris, Weisse, Ofir

arXiv.org Artificial Intelligence

These portable mid-level representations are then compiled into the byte-code which runs on the ML accelerator. The As models become larger, ML accelerators are a scarce resource development of each of these levels of abstraction requires a whose performance must be continually optimized to huge engineering effort, and inefficiencies introduced at any improve efficiency. Existing performance analysis tools are level can cause performance degradation for the model. The coarse grained, and fail to capture model performance at the companies that offer generative AI services are often doing so machine-code level. In addition, these tools often do not provide at a massive scale (for example, the infrastructure to provide specific recommendations for optimizations. We present inference for Microsoft's Bing AI chatbot is estimated to cost xPU-Shark, a fine-grained methodology for analyzing ML $4 billion [57]), meaning that even a small degradation in models at the machine-code level that provides actionable optimization performance can lead to large capital losses.


Approach to Visual Attractiveness of Event Space Through Data-Driven Environment and Spatial Perception

Majiid, Aliffi, Mian, Riaz-Ul-Haque, Kurohara, Kouki, Nguyen-Tran, Yen-Khang

arXiv.org Artificial Intelligence

Revitalizing Japan's remote areas has become a crucial task, and Matsue City exemplifies this effort in its temporary event spaces, created through collective efforts to foster urban vibrancy and bring together residents and visitors. This research examines the relationship between data-driven in-sights using generative AI and visual attractiveness by evaluating tempo-rary events in Matsue City, particularly considering the cognitive-cultural differences in processing visual information of the participants. The first phase employs semantic keyword extraction from interviews, categorizing responses into physical elements, activities, and atmosphere. The second phase analyzes spatial perception through three categories: layout hierar-chy, product visibility, and visual attention. The correlation indicates that successful event design requires a balance between spatial efficiency and diverse needs, with a spatial organization that optimizes visitor flow and visibility strategies considering cultural and demographic diversity. These findings contribute to understanding the urban quality of temporary event spaces and offer a replicable framework for enhancing the visual appeal of events in remote areas throughout Japan.


Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

Jain, Rishabh, Bhasi, Vivek M., Jog, Adwait, Sivasubramaniam, Anand, Kandemir, Mahmut T., Das, Chita R.

arXiv.org Artificial Intelligence

Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2x embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.


Revisiting SLO and Goodput Metrics in LLM Serving

Wang, Zhibin, Li, Shipeng, Zhou, Yuhang, Li, Xue, Gu, Rong, Cam-Tu, Nguyen, Tian, Chen, Zhong, Sheng

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved remarkable performance and are widely deployed in various applications, while the serving of LLM inference has raised concerns about user experience and serving throughput. Accordingly, service level objectives (SLOs) and goodput-the number of requests that meet SLOs per second-are introduced to evaluate the performance of LLM serving. However, existing metrics fail to capture the nature of user experience. We observe two ridiculous phenomena in existing metrics: 1) delaying token delivery can smooth the tail time between tokens (tail TBT) of a request and 2) dropping the request that fails to meet the SLOs midway can improve goodput. In this paper, we revisit SLO and goodput metrics in LLM serving and propose a unified metric framework smooth goodput including SLOs and goodput to reflect the nature of user experience in LLM serving. The framework can adapt to specific goals of different tasks by setting parameters. We re-evaluate the performance of different LLM serving systems under multiple workloads based on this unified framework and provide possible directions for future optimization of existing strategies. We hope that this framework can provide a unified standard for evaluating LLM serving and foster researches in the field of LLM serving optimization to move in a cohesive direction.


Metron: Holistic Performance Evaluation Framework for LLM Inference Systems

Agrawal, Amey, Agarwal, Anmol, Kedia, Nitin, Mohan, Jayashree, Kundu, Souvik, Kwatra, Nipun, Ramjee, Ramachandran, Tumanov, Alexey

arXiv.org Artificial Intelligence

Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this paper, we first identify the pitfalls of current performance metrics in evaluating LLM inference systems. We then propose Metron, a comprehensive performance evaluation framework that includes fluidity-index -- a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience. Finally, we evaluate various existing open-source platforms and model-as-a-service offerings using Metron, discussing their strengths and weaknesses. Metron is available at https://github.com/project-metron/metron.


Analytics of Longitudinal System Monitoring Data for Performance Prediction

Costello, Ian J., Bhatele, Abhinav

arXiv.org Artificial Intelligence

In recent years, several HPC facilities have started continuous monitoring of their systems and jobs to collect performance-related data for understanding performance and operational efficiency. Such data can be used to optimize the performance of individual jobs and the overall system by creating data-driven models that can predict the performance of jobs waiting in the scheduler queue. In this paper, we model the performance of representative control jobs using longitudinal system-wide monitoring data and machine learning to explore the causes of performance variability. We analyze these prediction models in great detail to identify the features that are dominant predictors of performance. We demonstrate that such models can be application-agnostic and can be used for predicting performance of applications that are not included in training.


Structured Reinforcement Learning for Media Streaming at the Wireless Edge

Bura, Archana, Bobbili, Sarat Chandra, Rameshkumar, Shreyas, Rengarajan, Desik, Kalathil, Dileep, Shakkottai, Srinivas

arXiv.org Artificial Intelligence

Media streaming is the dominant application over wireless edge (access) networks. The increasing softwarization of such networks has led to efforts at intelligent control, wherein application-specific actions may be dynamically taken to enhance the user experience. The goal of this work is to develop and demonstrate learning-based policies for optimal decision making to determine which clients to dynamically prioritize in a video streaming setting. We formulate the policy design question as a constrained Markov decision problem (CMDP), and observe that by using a Lagrangian relaxation we can decompose it into single-client problems. Further, the optimal policy takes a threshold form in the video buffer length, which enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show that a natural policy gradient (NPG) based algorithm that is derived using the structure of our problem converges to the globally optimal policy. We then develop a simulation environment for training, and a real-world intelligent controller attached to a WiFi access point for evaluation. We empirically show that the structured learning approach enables fast learning. Furthermore, such a structured policy can be easily deployed due to low computational complexity, leading to policy execution taking only about 15$\mu$s. Using YouTube streaming experiments in a resource constrained scenario, we demonstrate that the CRL approach can increase quality of experience (QOE) by over 30\%.


Chatterbox: Robust Transport for LLM Token Streaming under Unstable Network

Li, Hanchen, Liu, Yuhan, Cheng, Yihua, Ray, Siddhant, Du, Kuntai, Jiang, Junchen

arXiv.org Artificial Intelligence

To render each generated token in real time, the LLM server generates response tokens one by one and streams each generated token (or group of a few tokens) through the network to the user right after it is generated, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of tokens contained in subsequent packets even if they arrive on time. With a real-world measurement study, we show that current applications including ChatGPT, Claude, and Bard all suffer from increased stall under unstable network. For this emerging token streaming problem in LLM Chatbots, we propose a novel transport layer scheme, called Chatterbox, which puts new generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and can be independently rendered when received, thus avoiding aforementioned stalls caused by missing packets. Through simulation under various network conditions, we show Chatterbox reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the token streaming method commonly used by real chatbot applications and by 31.6% compared to a custom packet duplication scheme. By tailoring Chatterbox to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.


GITA guidance at AI stall for G20 delegates

#artificialintelligence

A modern GITA guidance, one that uses artificial intelligence, can help in finding a solution to life problems. The AI stall, set up as part of the exhibition under the first digital economy working group meeting of G20 nations in Lucknow, gives a glimpse into this. GITA is an acronym that means guidance, inspiration, transformation and action. "The software has included all verses from the Bhagavad Gita that are used when someone asks a question. The answers are given using AI to help find solutions to problems in life," said Akash Goel of Tagbin, who had installed the AI technology.


Analysis of Distributed Deep Learning in the Cloud

Sharma, Aakash, Bhasi, Vivek M., Singh, Sonali, Jain, Rishabh, Gunasekaran, Jashwant Raj, Mitra, Subrata, Kandemir, Mahmut Taylan, Kesidis, George, Das, Chita R.

arXiv.org Artificial Intelligence

We aim to resolve this problem by introducing a comprehensive distributed deep learning (DDL) profiler, which can determine the various execution "stalls" that DDL suffers from while running on a public cloud. We have implemented the profiler by extending prior work to additionally estimate two types of communication stalls - interconnect and network stalls. We train popular DNN models using the profiler to characterize various AWS GPU instances and list their advantages and shortcomings for users to make an informed decision. We observe that the more expensive GPU instances may not be the most performant for all DNN models and AWS may sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads up to 90% of DNN training time and network-connected instances can suffer from up to 5x slowdown compared to training on a single instance. Further, we model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls. Finally, we propose a measurement-based recommendation model for users to lower their public cloud monetary costs for DDL, given a time budget.