workload
A Universal Load Balancing Principle and Its Application to Large Language Model Serving
Chen, Zixi, Bu, Tianci, Song, Chendong, Lu, Xin, Ye, Yinyu, Zhou, Zijie
Load balancing-the allocation of work across parallel resources to reduce delay, energy and cost-is a pervasive challenge in science and engineering, from large-scale simulation and data processing to cloud and manufacturing operations. Motivated by the emerging bottleneck in large language model (LLM) serving, we study a particularly stringent regime of load balancing that arises in barrier-synchronized, stateful systems: work cannot be freely migrated and progress is gated by the slowest participant at each step, so heterogeneity and temporal drift in workloads create persistent stragglers and substantial idle time. LLM serving under data-parallel decoding provides a prominent modern instance: in production traces, barrier-induced idle can exceed 40% of compute time per decode step. Here we develop a universal load-balancing principle, which admits a step-wise finite-horizon integer-optimization formulation and yields worst-case guarantees: across LLM decode models and a broader class of non-decreasing workload drift processes, it reduces long-run imbalance by a factor that grows with batch size and system scale. Extensive experiments corroborate the theory, showing substantial improvements in throughput and latency together with reductions in energy consumption. These results provide a general, theoretically grounded framework for load balancing, with immediate implications for sustainable LLM serving and broad relevance to other synchronization-gated resource-allocation problems.
- North America > United States > California > San Francisco County > San Francisco (0.27)
- North America > United States > New York (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (6 more...)
Acer Swift Edge 14 AI review: A proper mobility champ
When you purchase through links in our articles, we may earn a small commission. Making it a little easier to bring greatness everywhere you go. It's not without its faults, but the Acer Swift Edge 14 AI otherwise delivers a great all-around experience with extra points going to the gorgeous matte display. If you're more often on the move than not, the Acer Swift Edge 14 AI will make a great partner. Acer has renewed its Swift line with a new compact model in the Swift Edge 14 AI, which not only boasts the thinness the Swift line has been known for but also an exceptionally low weight at just 2.18 pounds.
- Leisure & Entertainment > Games > Computer Games (0.42)
- Information Technology > Security & Privacy (0.30)
- Information Technology > Hardware (1.00)
- Information Technology > Artificial Intelligence (1.00)
A Adaptive Measurements
(Definition 1). In appendix D.4, we show that using this marginal trick significantly improves the performance of A.3 MWEM update Given the loss function: L The x-axis uses a logarithmic scale. We leave further investigation to future work. In this section we derive the update rule in algorithm 4. Recall that the ultimate goal is to solve In this section we assume that γ = 0 . We present hyperparameters used for methods across all experiments in Tables 1, 2, 3, 4, and 5. To limit the runtime of In Figures 5, 6, and 7, we present the same results for the same experiments described in Section 7.1 (Figures 1 and 2), adding plots for mean error and root mean squared error (RMSE).
- North America > United States > Ohio (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > California (0.04)
Toward Efficient Inference for Mixture of Experts
Mixture-of-Experts (MoE) models have recently gained steam in achieving the state-of-the-art performance in a wide range of tasks in computer vision and natural language processing. They effectively expand the model capacity while incurring a minimal increase in computation cost during training. However, deploying such models for inference is difficult due to their large model size and complex communication pattern. In this work, we provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) and identify their sources of inefficiencies at deployment. We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show that dynamic gating improves maximum throughput by 6.21-11.55$\times$
SUBP: Soft Uniform Block Pruning for 1 \times N Sparse CNNs Multithreading Acceleration
The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a \emph{Block Sparse Row} matrix.
Learning to Optimize Tensor Programs
We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU.
Efficient Algorithms for Device Placement of DNN Graph Operators
Modern machine learning workloads use large models, with complex structures, that are very expensive to execute. The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of Domain Specific Architectures (DSAs) being offered as hardware accelerators in addition to CPUs.
Acemagic M1 review: A mini PC with big-time power
When you purchase through links in our articles, we may earn a small commission. The Acemagic M1 packs Intel's i9-13900HK into a surprisingly compact housing, promising desktop performance in a minimum of space. The Acemagic M1 with i9-13900HK is a very fast, surprisingly compact mini PC that brings classic desktop performance and near silent operation. Although it lacks an NPU and some GPU power compared to the latest AI mini PCs, it impresses with a powerful processor, many ports and an attractive price. If you mainly run office, development, and moderate media workloads, you'll get a lot of computing power in a small form factor.
- Leisure & Entertainment > Games > Computer Games (0.43)
- Information Technology > Security & Privacy (0.31)
- Information Technology > Hardware (1.00)
- Information Technology > Artificial Intelligence (0.94)
ELANA: A Simple Energy and Latency Analyzer for LLMs
Chiang, Hung-Yueh, Wang, Bokun, Marculescu, Diana
The latency and power consumption of large language models (LLMs) are major constraints when serving them across a wide spectrum of hardware platforms, from mobile edge devices to cloud GPU clusters. Benchmarking is crucial for optimizing efficiency in both model deployment and next-generation model development. To address this need, we open-source a simple profiling tool, \textbf{ELANA}, for evaluating LLMs. ELANA is designed as a lightweight, academic-friendly profiler for analyzing model size, key-value (KV) cache size, prefilling latency (Time-to-first-token, TTFT), generation latency (Time-per-output-token, TPOT), and end-to-end latency (Time-to-last-token, TTLT) of LLMs on both multi-GPU and edge GPU platforms. It supports all publicly available models on Hugging Face and offers a simple command-line interface, along with optional energy consumption logging. Moreover, ELANA is fully compatible with popular Hugging Face APIs and can be easily customized or adapted to compressed or low bit-width models, making it ideal for research on efficient LLMs or for small-scale proof-of-concept studies. We release the ELANA profiling tool at: https://github.com/enyac-group/Elana.