Goto

Collaborating Authors

 output tensor


A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

DeBole, Michael V., Appuswamy, Rathinakumar, McGlohon, Neil, Taba, Brian, Esser, Steven K., Akopyan, Filipp, Arthur, John V., Amir, Arnon, Andreopoulos, Alexander, Carlson, Peter J., Cassidy, Andrew S., Datta, Pallab, Flickner, Myron D., Gandhasri, Rajamohan, Garreau, Guillaume J., Ito, Megumi, Klamo, Jennifer L., Kusnitz, Jeffrey A., McClatchey, Nathaniel J., McKinstry, Jeffrey L., Nayak, Tapan K., Otero, Carlos Ortega, Penner, Hartmut, Risk, William P., Sawada, Jun, Sivagnaname, Jay, Smith, Daniel F., Sousa, Rafael, Terrizzano, Ignacio, Ueda, Takanori, Gray-Donald, Trent, Cox, David, Modha, Dharmendra S.

arXiv.org Artificial Intelligence

Abstract--A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing data center (cloud, on-prem) environments. For example, the system can support 18 instances of a 3-billion-parameter model or a single instance of a 70-billion-parameter model. Large language models have become a pervasive form of computing, and while the current paradigm has been to push frontier models for all applications, it is becoming evident that "Faith in God-like large language models is waning" [1]. In fact, by continuing along this trajectory, global energy requirements for AI-focused data centers are projected to reach double-digit percentages of total electricity consumption by 2030, with individual facilities requiring up to 1 gigawatt or more of dedicated power--driving both infrastructure and cooling costs toward potentially unsustainable or unprofitable levels [2] [3]. However, for many business applications, frontier models containing trillions of parameters may prove less useful and cost efficient than much smaller language models with only a tenth or even a hundredth as many parameters [4].


LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

Tahmasivand, Ahmad, Zahran, Noureldin, Al-Sayouri, Saba, Fouda, Mohammed, Khasawneh, Khaled N.

arXiv.org Artificial Intelligence

Abstract--Bit-flip attacks threaten the reliability and security of Language Models (LMs) by altering internal parameters and compromising output integrity. Recent studies show that flipping only a few bits in model parameters can bypass safety mechanisms and jailbreak the model. Existing detection approaches for DNNs and CNNs are not suitable for LMs, as the massive number of parameters significantly increases timing and memory overhead for software-based methods and chip area overhead for hardware-based methods. In this work, we present LM-Fix, a lightweight LM-driven detection and recovery framework that leverages the model's own capabilities to identify and recover faults. Our method detects bit-flips by generating a single output token from a predefined test vector and auditing the output tensor of a target layer against stored reference data. The same mechanism enables rapid recovery without reloading the entire model. Experiments across various models show that LM-Fix detects more than 94% of single-bit flips and nearly 100% of multi-bit flips, with very low computational overhead ( 1%- 7.7% at TVL = 200 across models). Recovery achieves more than 100 speedup compared to full-model reload, which is critical in edge devices. LM-Fix can handle bit-flips affecting any part of the model's computation, including memory, cache, and arithmetic operations. Evaluation against recent LM-specific bit-flip attacks confirms its robustness and practical value for real-world deployment.


1091660f3dff84fd648efe31391c5524-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for insightful comments. Y our recognition of our work is much appreciated. The longer they are kept, the higher the number of bit flips they will suffer from. This easily results in a high fault rate (e.g. With that said, we consider extending our protection approaches to lower bit width as future work.


TrainVerify: Equivalence-Based Verification for Distributed LLM Training

Lu, Yunchi, Miao, Youshan, Tan, Cheng, Huang, Peng, Zhu, Yi, Zhang, Xian, Yang, Fan

arXiv.org Artificial Intelligence

Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model's logical specification as the ground truth, TrainVerify formally verifies that a distributed parallel execution plan is mathematically equivalent to it. Direct verification is notoriously difficult due to the sheer scale of LLMs which often involves billions of variables and highly intricate computation graphs. Therefore, TrainVerify introduces shape-reduction techniques and a stage-wise parallel verification algorithm that significantly reduces complexity while preserving formal correctness. TrainVerify scales to frontier LLMs, including the successful verification of the Llama3 (405B) and DeepSeek-V3 (671B) training plans.



High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

Fu, Xiang, Zhang, Xinpeng, Ma, Jixiang, Zhao, Peng, Lu, Shuai, Liu, Xu T.

arXiv.org Artificial Intelligence

Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency. Yet, there is still a lack of comprehensive performance characterization on data layouts on SIMD architectures concerning convolution methods. This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques for both direct and im2win convolutions. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. The experiments demonstrated that the im2win convolution with the new NHWC layout achieved up to 355% performance speedup over NCHW layout. Our optimizations also significantly improve the performance of both im2win and direct convolutions. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.


vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Zheng, Size, Chen, Renze, Li, Meng, Ye, Zihao, Ceze, Luis, Liang, Yun

arXiv.org Artificial Intelligence

IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN). However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs. Previous work separates memory management and kernel implementation for MCU and relies on coarse-grained memory management techniques such as inplace update to reduce memory consumption. In this paper, we propose to coordinate memory management and kernel optimization for DNN inference on MCUs to enable fine-grained memory management. The key idea is to virtualize the limited memory of MCU as a large memory pool. Each kernel divides the memory pool into kernel-specific segments and handles segment load and store while computing DNN layers. Memory consumption can be reduced because using the fine-grained segment-level memory control, we can overlap the memory footprint of different tensors without the need to materialize them at the same time. Following this idea, we implement \ours{} for DNN inference on MCU. Evaluation for single layers on ARM Cortex-M4 and Cortex-M7 processors shows that \ours{} can reduce from $12.0\%$ to $49.5\%$ RAM usage and from $20.6\%$ to $53.0\%$ energy consumption compared to state-of-the-art work. For full DNN evaluation, \ours{} can reduce the memory bottleneck by $61.5\%$, enabling more models to be deployed on low-end MCUs.


Transformer for Times Series: an Application to the S&P500

Brugiere, Pierre, Turinici, Gabriel

arXiv.org Machine Learning

The transformer models have been extensively used with good results in a wide area of machine learning applications including Large Language Models and image generation. Here, we inquire on the applicability of this approach to financial time series. We first describe the dataset construction for two prototypical situations: a mean reverting synthetic Ornstein-Uhlenbeck process on one hand and real S&P500 data on the other hand. Then, we present in detail the proposed Transformer architecture and finally we discuss some encouraging results. For the synthetic data we predict rather accurately the next move, and for the S&P500 we get some interesting results related to quadratic variation and volatility prediction.


Distributed Deep Learning Inference Acceleration using Seamless Collaboration in Edge Computing

Li, Nan, Iosifidis, Alexandros, Zhang, Qi

arXiv.org Artificial Intelligence

This paper studies inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing. To ensure inference accuracy in inference task partitioning, we consider the receptive-field when performing segment-based partitioning. To maximize the parallelization between the communication and computing processes, thereby minimizing the total inference time of an inference task, we design a novel task collaboration scheme in which the overlapping zone of the sub-tasks on secondary edge servers (ESs) is executed on the host ES, named as HALP. We further extend HALP to the scenario of multiple tasks. Experimental results show that HALP can accelerate CNN inference in VGG-16 by 1.7-2.0x for a single task and 1.7-1.8x for 4 tasks per batch on GTX 1080TI and JETSON AGX Xavier, which outperforms the state-of-the-art work MoDNN. Moreover, we evaluate the service reliability under time-variant channel, which shows that HALP is an effective solution to ensure high service reliability with strict service deadline.


Everything You Need to Know About Tensors - KDnuggets

#artificialintelligence

TensorFlow is the go-to library for most machine learning model developers. It comes with the ease of providing standard Keras API to allow users to build their own neural networks and is equally prevalent in research and commercial applications. A tensor is a multi-dimensional array of elements with a single data type. It has two key properties – shape and the data type such as float, integer, or string. TensorFlow includes eager execution where code is examined step by step making it easier to debug.