fpga
Model Recovery at the Edge under Resource Constraints for Physical AI
Xu, Bin, Banerjee, Ayan, Gupta, Sandeep K. S.
Model Recovery (MR) enables safe, explainable decision making in mission-critical autonomous systems (MCAS) by learning governing dynamical equations, but its deployment on edge devices is hindered by the iterative nature of neural ordinary differential equations (NODEs), which are inefficient on FPGAs. Memory and energy consumption are the main concerns when applying MR on edge devices for real-time operation. We propose MERINDA, a novel FPGA-accelerated MR framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs. MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs. Experiments reveal an inverse relationship between memory and energy at fixed accuracy, highlighting MERINDA's suitability for resource-constrained, real-time MCAS.
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- Atlantic Ocean > North Atlantic Ocean > Hudson Bay (0.04)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.68)
- Energy (0.67)
hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware
Schulte, Jan-Frederik, Ramhorst, Benjamin, Sun, Chang, Mitrevski, Jovan, Ghielmetti, Nicolò, Lupi, Enrico, Danopoulos, Dimitrios, Loncar, Vladimir, Duarte, Javier, Burnette, David, Laatu, Lauri, Tzelepis, Stylianos, Axiotis, Konstantinos, Berthet, Quentin, Wang, Haoyan, White, Paul, Demirsoy, Suleyman, Colombo, Marco, Aarrestad, Thea, Summers, Sioni, Pierini, Maurizio, Di Guglielmo, Giuseppe, Ngadiuba, Jennifer, Campos, Javier, Hawks, Ben, Gandrakota, Abhijith, Fahim, Farah, Tran, Nhan, Constantinides, George, Que, Zhiqiang, Luk, Wayne, Tapper, Alexander, Hoang, Duc, Paladino, Noah, Harris, Philip, Lai, Bo-Cheng, Valentin, Manuel, Forelli, Ryan, Ogrenci, Seda, Gerlach, Lino, Flynn, Rian, Liu, Mia, Diaz, Daniel, Khoda, Elham, Quinnan, Melissa, Solares, Russell, Parajuli, Santosh, Neubauer, Mark, Herwig, Christian, Tsoi, Ho Fung, Rankin, Dylan, Hsu, Shih-Chieh, Hauck, Scott
We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into full designs for field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). With its flexible and modular design, hls4ml supports a large number of deep learning frameworks and can target HLS compilers from several vendors, including Vitis HLS, Intel oneAPI and Catapult HLS. Together with a wider eco-system for software-hardware co-design, hls4ml has enabled the acceleration of ML inference in a wide range of commercial and scientific applications where low latency, resource usage, and power consumption are critical. In this paper, we describe the structure and functionality of the hls4ml platform. The overarching design considerations for the generated HLS code are discussed, together with selected performance results.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (26 more...)
- Information Technology (1.00)
- Government > Regional Government > North America Government > United States Government (0.93)
- Health & Medicine > Therapeutic Area (0.92)
- Energy (0.67)
Beyond the GPU: The Strategic Role of FPGAs in the Next Wave of AI
AI acceleration has been dominated by GPUs, but the growing need for lower latency, energy efficiency, and fine-grained hardware control exposes the limits of fixed architectures. In this context, Field-Programmable Gate Arrays (FPGAs) emerge as a reconfigurable platform that allows mapping AI algorithms directly into device logic. Their ability to implement parallel pipelines for convolutions, attention mechanisms, and post-processing with deterministic timing and reduced power consumption makes them a strategic option for workloads that demand predictable performance and deep customization. Unlike CPUs and GPUs, whose architecture is immutable, an FPGA can be reconfigured in the field to adapt its physical structure to a specific model, integrate as a SoC with embedded processors, and run inference near the sensor without sending raw data to the cloud. This reduces latency and required bandwidth, improves privacy, and frees GPUs from specialized tasks in data centers. Partial reconfiguration and compilation flows from AI frameworks are shortening the path from prototype to deployment, enabling hardware--algorithm co-design.
JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs
Que, Zhiqiang, Sun, Chang, Paramesvaran, Sudarshan, Clement, Emyr, Karakoulaki, Katerina, Brown, Christopher, Laatu, Lauri, Cox, Arianna, Tapper, Alexander, Luk, Wayne, Spiropulu, Maria
Graph Neural Networks (GNNs), particularly Interaction Networks (INs), have shown exceptional performance for jet tagging at the CERN High-Luminosity Large Hadron Collider (HL-LHC). However, their computational complexity and irregular memory access patterns pose significant challenges for deployment on FPGAs in hardware trigger systems, where strict latency and resource constraints apply. In this work, we propose JEDI-linear, a novel GNN architecture with linear computational complexity that eliminates explicit pairwise interactions by leveraging shared transformations and global aggregation. To further enhance hardware efficiency, we introduce fine-grained quantization-aware training with per-parameter bitwidth optimization and employ multiplier-free multiply-accumulate operations via distributed arithmetic. Evaluation results show that our FPGA-based JEDI-linear achieves 3.7 to 11.5 times lower latency, up to 150 times lower initiation interval, and up to 6.2 times lower LUT usage compared to state-of-the-art GNN designs while also delivering higher model accuracy and eliminating the need for DSP blocks entirely. This is the first interaction-based GNN to achieve less than 60~ns latency and currently meets the requirements for use in the HL-LHC CMS Level-1 trigger system. This work advances the next-generation trigger systems by enabling accurate, scalable, and resource-efficient GNN inference in real-time environments. Our open-sourced templates will further support reproducibility and broader adoption across scientific applications.
- Europe > Ukraine > Volyn Oblast > Luts'k (0.04)
- North America > United States > California > Los Angeles County > Pasadena (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (2 more...)
Knowledge is Overrated: A zero-knowledge machine learning and cryptographic hashing-based framework for verifiable, low latency inference at the LHC
Jawahar, Pratik, Doglioni, Caterina, Pierini, Maurizio
Low latency event-selection (trigger) algorithms are essential components of Large Hadron Collider (LHC) operation. Modern machine learning (ML) models have shown great offline performance as classifiers and could improve trigger performance, thereby improving downstream physics analyses. However, inference on such large models does not satisfy the $40\text{MHz}$ online latency constraint at the LHC. In this work, we propose \texttt{PHAZE}, a novel framework built on cryptographic techniques like hashing and zero-knowledge machine learning (zkML) to achieve low latency inference, via a certifiable, early-exit mechanism from an arbitrarily large baseline model. We lay the foundations for such a framework to achieve nanosecond-order latency and discuss its inherent advantages, such as built-in anomaly detection, within the scope of LHC triggers, as well as its potential to enable a dynamic low-level trigger in the future.
Organization
The intermediate residual blocks have convolution layers with the stride 2 for down-sampling. Specifications of reference devices used in the paper. Now, we describe the latency measurement pipeline for desktop GPUs, Jetson, server CPUs, and mobile phone. Spearman's rank correlation coefficient of collected latencies among 8 representative devices in As shown in Table A.3, the measured latencies on the same set of architectures Visualization of 10 reference neural architectures we used for NAS-Bench-201 search space. Architecture indices of NAS-Bench-201 are 11982, 13479, 14451, 1462, 431, 55, 6196, 8636, 9, 9881 in order of left top to right bottom.
- Information Technology > Hardware (0.50)
- Semiconductors & Electronics (0.48)
The Role of Advanced Computer Architectures in Accelerating Artificial Intelligence Workloads
Amin, Shahid, Shah, Syed Pervez Hussnain
The remarkable progress in Artificial Intelligence (AI) is foundation-ally linked to a concurrent revolution in computer architecture. As AI models, particularly Deep Neural Networks (DNNs), have grown in complexity, their massive computational demands have pushed traditional architectures to their limits. This paper provides a structured review of this co-evolution, analyzing the architectural landscape designed to accelerate modern AI workloads. We explore the dominant architectural paradigms Graphics Processing Units (GPUs), Appli-cation-Specific Integrated Circuits (ASICs), and Field-Programmable Gate Ar-rays (FPGAs) by breaking down their design philosophies, key features, and per-formance trade-offs. The core principles essential for performance and energy efficiency, including dataflow optimization, advanced memory hierarchies, spar-sity, and quantization, are analyzed. Furthermore, this paper looks ahead to emerging technologies such as Processing-in-Memory (PIM) and neuromorphic computing, which may redefine future computation. By synthesizing architec-tural principles with quantitative performance data from industry-standard benchmarks, this survey presents a comprehensive picture of the AI accelerator landscape. We conclude that AI and computer architecture are in a symbiotic relationship, where hardware-software co-design is no longer an optimization but a necessity for future progress in computing.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.04)
- Information Technology (1.00)
- Semiconductors & Electronics (0.67)
FPGA or GPU? Analyzing comparative research for application-specific guidance
Purkayastha, Arnab A, Tharwani, Jay, Aggarwal, Shobhit
The growing complexity of computational workloads has amplified the need for efficient and specialized hardware accelerators. Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) have emerged as prominent solutions, each excelling in specific domains. Although there is substantial research comparing FPGAs and GPUs, most of the work focuses primarily on performance metrics, offering limited insight into the specific types of applications that each accelerator benefits the most. This paper aims to bridge this gap by synthesizing insights from various research articles to guide users in selecting the appropriate accelerator for domain-specific applications. By categorizing the reviewed studies and analyzing key performance metrics, this work highlights the strengths, limitations, and ideal use cases for FPGAs and GPUs. The findings offer actionable recommendations, helping researchers and practitioners navigate trade-offs in performance, energy efficiency, and programmability.
- North America > United States > Virginia (0.04)
- North America > United States > North Carolina > Mecklenburg County > Charlotte (0.04)
- North America > United States > Massachusetts > Hampden County > Springfield (0.04)
Sub-microsecond Transformers for Jet Tagging on FPGAs
Laatu, Lauri, Sun, Chang, Cox, Arianna, Gandrakota, Abhijith, Maier, Benedikt, Ngadiuba, Jennifer, Que, Zhiqiang, Luk, Wayne, Spiropulu, Maria, Tapper, Alexander
We present the first sub-microsecond transformer implementation on an FPGA achieving competitive performance for state-of-the-art high-energy physics benchmarks. Transformers have shown exceptional performance on multiple tasks in modern machine learning applications, including jet tagging at the CERN Large Hadron Collider (LHC). However, their computational complexity prohibits use in real-time applications, such as the hardware trigger system of the collider experiments up until now. In this work, we demonstrate the first application of transformers for jet tagging on FPGAs, achieving $\mathcal{O}(100)$ nanosecond latency with superior performance compared to alternative baseline models. We leverage high-granularity quantization and distributed arithmetic optimization to fit the entire transformer model on a single FPGA, achieving the required throughput and latency. Furthermore, we add multi-head attention and linear attention support to hls4ml, making our work accessible to the broader fast machine learning community. This work advances the next-generation trigger systems for the High Luminosity LHC, enabling the use of transformers for real-time applications in high-energy physics and beyond.
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Ukraine > Volyn Oblast > Luts'k (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
StrikeWatch: Wrist-worn Gait Recognition with Compact Time-series Models on Low-power FPGAs
Ling, Tianheng, Qian, Chao, Zdankin, Peter, Weis, Torben, Schiele, Gregor
Abstract--Running offers substantial health benefits, but improper gait patterns can lead to injuries, particularly without expert feedback. While prior gait analysis systems based on cameras, insoles, or body-mounted sensors have demonstrated effectiveness, they are often bulky and limited to offline, post-run analysis. Wrist-worn wearables offer a more practical and non-intrusive alternative, yet enabling real-time gait recognition on such devices remains challenging due to noisy Inertial Measurement Unit (IMU) signals, limited computing resources, and dependence on cloud connectivity. This paper introduces StrikeW atch, a compact wrist-worn system that performs entirely on-device, real-time gait recognition using IMU signals. As a case study, we target the detection of heel versus forefoot strikes to enable runners to self-correct harmful gait patterns through visual and auditory feedback during running. We propose four compact DL architectures (1D-CNN, 1D-SepCNN, LSTM, and Transformer) and optimize them for energy-efficient inference on two representative embedded Field-Programmable Gate Arrays (FPGAs): the AMD Spartan-7 XC7S15 and the Lattice iCE40UP5K. Using our custom-built hardware prototype, we collect a labeled dataset from outdoor running sessions and evaluate all models via a fully automated deployment pipeline. Our results reveal clear trade-offs between model complexity and hardware efficiency. Evaluated across 12 participants, 6-bit quantized 1D-SepCNN achieves the highest average F1 score of 0.847 while consuming just 0.350 µJ per inference with a latency of 0.140 ms on the iCE40UP5K running at 20 MHz. This configuration supports up to 13.6 days of continuous inference on a 320 mAh battery. Running is one of the most widely practiced sports worldwide, offering significant physical and mental benefits [1].
- Energy > Energy Storage (0.34)
- Health & Medicine > Public Health (0.34)