CARMA: Collocation-Aware Resource Manager
Yousefzadeh-Asl-Miandoab, Ehsan, Karimzadeh, Reza, Ibragimov, Bulat, Ciorba, Florina M., Tözün, Pınar
–arXiv.org Artificial Intelligence
GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource management system for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to avoid OOMs and limit interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a $\sim$35% and $\sim$15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.
arXiv.org Artificial Intelligence
Nov-4-2025
- Country:
- Europe
- Denmark > Capital Region
- Copenhagen (0.04)
- Switzerland > Basel-City
- Basel (0.04)
- Denmark > Capital Region
- North America
- Canada > Ontario
- Toronto (0.14)
- United States
- California
- San Diego County > Carlsbad (0.04)
- Santa Clara County > Santa Clara (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- California
- Canada > Ontario
- Europe
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Energy (0.48)
- Technology: