Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference
Li, Yinghan, Li, Yifei, Zhang, Jiejing, Chen, Bujiao, Chen, Xiaotong, Duan, Lian, Jin, Yejun, Li, Zheng, Liu, Xuanyu, Wang, Haoyu, Wang, Wente, Wang, Yajie, Yang, Jiacheng, Zhang, Peiyang, Zheng, Laiwen, Yu, Wenyuan
–arXiv.org Artificial Intelligence
Resource utilization is one of the key factors in fully exploiting the computing power of massively parallel devices, including GPUs. As a common method to improve utilization and reduce overhead, the benefit of the batching technique should never be underestimated [7, 8, 11]. In most cases, it is handy to batch regular workloads that share the same type and size, which also have similar amounts of computation and memory access. For example, in the CUDA programming model, this kind of regular workloads can be conveniently batched along an additional thread block or grid dimension [15]. However, irregular workloads do not naturally fit into this scheme. Irregular workloads may show one or more of the following characteristics that prevent regular batching[1]: variable amounts of computation, special memory access patterns, control flow divergence, etc. Moreover, heterogeneous workloads almost raise the difficulty of batching to an unreachable level. Here, by heterogeneous, we refer to workloads of different types of operations, e.g., some of the workloads are reduction, while others are element-wise operations. Irregular workloads are often managed in a task-parallel fashion instead of batching, where an individual workload is regarded as a task, and all tasks are dynamically scheduled [1, 19].
arXiv.org Artificial Intelligence
Jan-27-2025
- Country:
- Europe > France
- Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)
- North America > United States
- New York > New York County > New York City (0.05)
- Europe > France
- Genre:
- Research Report (0.41)
- Technology: