Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs

Sada, Mohammad Firas, Graham, John J., Khoda, Elham E, Tatineni, Mahidhar, Mishin, Dmitry, Gupta, Rajesh K., Wagner, Rick, Smarr, Larry, DeFanti, Thomas A., Würthwein, Frank

arXiv.org Artificial Intelligence 

The rapid proliferation of large language models (LLMs) has fundamentally transformed scientific computing, enabling breakthroughs across domains from computational biology to materials science. As these models scale to hundreds of billions of parameters, high-performance computing (HPC) facilities face mounting challenges in providing sustainable, cost-effective inference capabilities to diverse research communities. Traditional GPU-centric approaches, while delivering exceptional throughput, present significant barriers in terms of power consumption, cooling requirements, and capital investment, particularly problematic for shared research cyberinfrastructures serving hundreds of concurrent users. The National Research Platform (NRP) exemplifies these challenges and opportunities. As a federated Kubernetes-based infrastructure supporting over 300 research groups across over 100 sites, the NRP must balance competing demands: delivering high-performance AI capabilities while managing constrained power budgets, enabling fine-grained resource allocation for multi-tenant workloads, and providing cost-effective access to emerging AI models for diverse scientific applications [1, 2]. 1