A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

DeBole, Michael V., Appuswamy, Rathinakumar, McGlohon, Neil, Taba, Brian, Esser, Steven K., Akopyan, Filipp, Arthur, John V., Amir, Arnon, Andreopoulos, Alexander, Carlson, Peter J., Cassidy, Andrew S., Datta, Pallab, Flickner, Myron D., Gandhasri, Rajamohan, Garreau, Guillaume J., Ito, Megumi, Klamo, Jennifer L., Kusnitz, Jeffrey A., McClatchey, Nathaniel J., McKinstry, Jeffrey L., Nayak, Tapan K., Otero, Carlos Ortega, Penner, Hartmut, Risk, William P., Sawada, Jun, Sivagnaname, Jay, Smith, Daniel F., Sousa, Rafael, Terrizzano, Ignacio, Ueda, Takanori, Gray-Donald, Trent, Cox, David, Modha, Dharmendra S.

arXiv.org Artificial Intelligence 

Abstract--A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing data center (cloud, on-prem) environments. For example, the system can support 18 instances of a 3-billion-parameter model or a single instance of a 70-billion-parameter model. Large language models have become a pervasive form of computing, and while the current paradigm has been to push frontier models for all applications, it is becoming evident that "Faith in God-like large language models is waning" [1]. In fact, by continuing along this trajectory, global energy requirements for AI-focused data centers are projected to reach double-digit percentages of total electricity consumption by 2030, with individual facilities requiring up to 1 gigawatt or more of dedicated power--driving both infrastructure and cooling costs toward potentially unsustainable or unprofitable levels [2] [3]. However, for many business applications, frontier models containing trillions of parameters may prove less useful and cost efficient than much smaller language models with only a tenth or even a hundredth as many parameters [4].