A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU

Sep-24-2025–arXiv.org Artificial Intelligence

Autoregressive inference in large transformer-based language models (LLMs) presents significant challenges for runtime efficiency, particularly during the decode phase where load imbalance across GPU shards can cause throughput degradation and latency spikes. A DPU-assisted framework leveraged by BlueField-3 Data Processing Units can enable real-time detection and mitigation of load imbalance in multi-node tensor-parallel inference. By offloading monitoring tasks to the DPU and analyzing GPU telemetry and inter-node communication patterns, the resulting system can provide actionable feedback to inference controllers and schedulers. The goal of this study is three-fold i) identify the reported skews/imbalances/pathological conditions that arise in muti-GPU execution of a) LLM tensor computing (both during training and inference), b) identify their impact on computational performance, and c) make a critical assessment if those can be tracked for potential mitigation from a DPU network.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Sep-24-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.41)

Industry:
- Information Technology (0.50)

Technology:
- Information Technology
  - Hardware (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found