cnn accelerator
Implémentation Efficiente de Fonctions de Convolution sur FPGA à l'Aide de Blocs Paramétrables et d'Approximations Polynomiales
Magalhães, Philippe, Fresse, Virginie, Suffran, Benoît, Alata, Olivier
Implementing convolutional neural networks (CNNs) on field-programmable gate arrays (FPGAs) has emerged as a promising alternative to GPUs, offering lower latency, greater power efficiency and greater flexibility. However, this development remains complex due to the hardware knowledge required and the long synthesis, placement and routing stages, which slow down design cycles and prevent rapid exploration of network configurations, making resource optimisation under severe constraints particularly challenging. This paper proposes a library of configurable convolution Blocks designed to optimize FPGA implementation and adapt to available resources. It also presents a methodological framework for developing mathematical models that predict FPGA resources utilization. The approach is validated by analyzing the correlation between the parameters, followed by error metrics. The results show that the designed blocks enable adaptation of convolution layers to hardware constraints, and that the models accurately predict resource consumption, providing a useful tool for FPGA selection and optimized CNN deployment.
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
- North America > Mexico > Quintana Roo > Cancún (0.04)
- North America > Canada > Ontario > Middlesex County > London (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
FPGA-Based Neural Network Accelerators for Space Applications: A Survey
Antunes, Pedro, Podobas, Artur
Space missions are becoming increasingly ambitious, necessitating high-performance onboard spacecraft computing systems. In response, field-programmable gate arrays (FPGAs) have garnered significant interest due to their flexibility, cost-effectiveness, and radiation tolerance potential. Concurrently, neural networks (NNs) are being recognized for their capability to execute space mission tasks such as autonomous operations, sensor data analysis, and data compression. This survey serves as a valuable resource for researchers aiming to implement FPGA-based NN accelerators in space applications. By analyzing existing literature, identifying trends and gaps, and proposing future research directions, this work highlights the potential of these accelerators to enhance onboard computing systems.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Oklahoma > Beaver County (0.04)
- North America > United States > Louisiana (0.04)
- (14 more...)
- Overview (1.00)
- Research Report > New Finding (0.46)
- Government > Space Agency (1.00)
- Semiconductors & Electronics (0.93)
- Information Technology (0.93)
- (2 more...)
A 10.60 $\mu$W 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection
Qin, Yifan, Jia, Zhenge, Yan, Zheyu, Mok, Jay, Yung, Manto, Liu, Yu, Liu, Xuejiao, Wen, Wujie, Liang, Luhong, Cheng, Kwang-Ting Tim, Hu, X. Sharon, Shi, Yiyu
This paper proposes an ultra-low power, mixed-bit-width sparse convolutional neural network (CNN) accelerator to accelerate ventricular arrhythmia (VA) detection. The chip achieves 50% sparsity in a quantized 1D CNN using a sparse processing element (SPE) architecture. Measurement on the prototype chip TSMC 40nm CMOS low-power (LP) process for the VA classification task demonstrates that it consumes 10.60 $\mu$W of power while achieving a performance of 150 GOPS and a diagnostic accuracy of 99.95%. The computation power density is only 0.57 $\mu$W/mm$^2$, which is 14.23X smaller than state-of-the-art works, making it highly suitable for implantable and wearable medical devices.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.16)
- North America > United States > North Carolina (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Hong Kong (0.04)
PENDRAM: Enabling High-Performance and Energy-Efficient Processing of Deep Neural Networks through a Generalized DRAM Data Mapping Policy
Putra, Rachmad Vidya Wicaksana, Hanif, Muhammad Abdullah, Shafique, Muhammad
Convolutional Neural Networks (CNNs), a prominent type of Deep Neural Networks (DNNs), have emerged as a state-of-the-art solution for solving machine learning tasks. To improve the performance and energy efficiency of CNN inference, the employment of specialized hardware accelerators is prevalent. However, CNN accelerators still face performance- and energy-efficiency challenges due to high off-chip memory (DRAM) access latency and energy, which are especially crucial for latency- and energy-constrained embedded applications. Moreover, different DRAM architectures have different profiles of access latency and energy, thus making it challenging to optimize them for high performance and energy-efficient CNN accelerators. To address this, we present PENDRAM, a novel design space exploration methodology that enables high-performance and energy-efficient CNN acceleration through a generalized DRAM data mapping policy. Specifically, it explores the impact of different DRAM data mapping policies and DRAM architectures across different CNN partitioning and scheduling schemes on the DRAM access latency and energy, then identifies the pareto-optimal design choices. The experimental results show that our DRAM data mapping policy improves the energy-delay-product of DRAM accesses in the CNN accelerator over other mapping policies by up to 96%. In this manner, our PENDRAM methodology offers high-performance and energy-efficient CNN acceleration under any given DRAM architectures for diverse embedded AI applications.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > New York (0.05)
- (3 more...)
SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network Accelerator
Chen, Yukai, Yang, Simei, Bhattacharjee, Debjyoti, Catthoor, Francky, Mallik, Arindam
--The design of energy-efficient, high-performance, and reliable Convolutional Neural Network (CNN) accelerators involves significant challenges due to complex power and thermal management issues. This paper introduces SAfEPaTh, a novel system-level approach for accurately estimating power and temperature in tile-based CNN accelerators. Unlike traditional methods, it eliminates the need for circuit-level simulations or on-chip measurements. Through rigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh's capability to accurately estimate power and temperature within 500 seconds, encompassing CNN model accelerator mapping exploration and detailed power and thermal estimations. Furthermore, SAfEPaTh's adaptability extends its utility across various CNN models and accelerator architectures, underscoring its broad applicability in the field. This study contributes significantly to the advancement of energy-efficient and reliable CNN accelerator designs, addressing critical challenges in dynamic power and thermal management. I NTRODUCTION Convolutional Neural Network (CNN) accelerators have become essential for many modern applications, such as autonomous driving, image and speech recognition, and natural language processing, all of which demand low power consumption, high performance, and high reliability [1]. However, as semiconductor scaling approaches its physical limits, managing dynamic power and thermal issues in accelerators executing CNN models poses increasing challenges.
SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs
Vatsavai, Sairam Sri, Karempudi, Venkata Sai Praneeth, Thakkar, Ishan, Salehi, Ahmad, Hastings, Todd
The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photonic MRR-based analog accelerators exhibit a very strong trade-off between the achievable input/weight precision and VDP operation size, which severely restricts their achievable VDP operation size for the quantized input/weight precision of 4 bits and higher. The restricted VDP operation size ultimately suppresses computing throughput to severely diminish the achievable performance benefits. To address this shortcoming, we for the first time present a merger of stochastic computing and MRR-based CNN accelerators. To leverage the innate precision flexibility of stochastic computing, we invent an MRR-based optical stochastic multiplier (OSM). We employ multiple OSMs in a cascaded manner using dense wavelength division multiplexing, to forge a novel Stochastic Computing based Optical Neural Network Accelerator (SCONNA). SCONNA achieves significantly high throughput and energy efficiency for accelerating inferences of high-precision quantized CNNs. Our evaluation for the inference of four modern CNNs at 8-bit input/weight precision indicates that SCONNA provides improvements of up to 66.5x, 90x, and 91x in frames-per-second (FPS), FPS/W and FPS/W/mm2, respectively, on average over two photonic MRR-based analog CNN accelerators from prior work, with Top-1 accuracy drop of only up to 0.4% for large CNNs and up to 1.5% for small CNNs. We developed a transaction-level, event-driven python-based simulator for the evaluation of SCONNA and other accelerators (https://github.com/uky-UCAT/SC_ONN_SIM.git).
- North America > United States > Kentucky > Fayette County > Lexington (0.14)
- North America > United States > New York (0.04)
Photonic Reconfigurable Accelerators for Efficient Inference of CNNs with Mixed-Sized Tensors
Vatsavai, Sairam Sri, Thakkar, Ishan G
Photonic Microring Resonator (MRR) based hardware accelerators have been shown to provide disruptive speedup and energy-efficiency improvements for processing deep Convolutional Neural Networks (CNNs). However, previous MRR-based CNN accelerators fail to provide efficient adaptability for CNNs with mixed-sized tensors. One example of such CNNs is depthwise separable CNNs. Performing inferences of CNNs with mixed-sized tensors on such inflexible accelerators often leads to low hardware utilization, which diminishes the achievable performance and energy efficiency from the accelerators. In this paper, we present a novel way of introducing reconfigurability in the MRR-based CNN accelerators, to enable dynamic maximization of the size compatibility between the accelerator hardware components and the CNN tensors that are processed using the hardware components. We classify the state-of-the-art MRR-based CNN accelerators from prior works into two categories, based on the layout and relative placements of the utilized hardware components in the accelerators. We then use our method to introduce reconfigurability in accelerators from these two classes, to consequently improve their parallelism, the flexibility of efficiently mapping tensors of different sizes, speed, and overall energy efficiency. We evaluate our reconfigurable accelerators against three prior works for the area proportionate outlook (equal hardware area for all accelerators). Our evaluation for the inference of four modern CNNs indicates that our designed reconfigurable CNN accelerators provide improvements of up to 1.8x in Frames-Per-Second (FPS) and up to 1.5x in FPS/W, compared to an MRR-based accelerator from prior work.
More Efficient AI Training With A Systolic Neural CPU
Since modern machine learning came onto the scene, the push has been on to make workloads leveraging the technologies as efficient as possible. Given what these applications are being asked to do – from being a cornerstone of the oncoming autonomous vehicle era to real-time analytics to helping enterprises more user-friendly, more profitable and more able to thwart cyberattacks – the speed and efficiency of such techniques as deep neural networks (DNNs) has been at the forefront. That has included leveraging the capabilities of accelerators like GPUs and FPGAs and introducing new devices, such as Amazon Web Services' Trainium devices for AI and machine learning training tasks in the cloud. However, such efforts have only addressed part of the issue, according to Yuhao Ju, a Ph.D candidate at Northwestern University. What so many of the efforts to make DNN accelerators more efficient have missed is the need to improve the end-to-end performance of the deep learning tasks themselves, where the movement of data is still a significant factor in the overall execution time of the job. Speaking at the recent virtual ISSCC event, Ju said that for high-performing machine learning jobs, a CPU also is needed for pre- and post-processing and data preparation for the accelerator.
CoDR: Computation and Data Reuse Aware CNN Accelerator
Khadem, Alireza, Ye, Haojie, Mudge, Trevor
Computation and Data Reuse is critical for the resource-limited Convolutional Neural Network (CNN) accelerators. This paper presents Universal Computation Reuse to exploit weight sparsity, repetition, and similarity simultaneously in a convolutional layer. Moreover, CoDR decreases the cost of weight memory access by proposing a customized Run-Length Encoding scheme and the number of memory accesses to the intermediate results by introducing an input and output stationary dataflow. Compared to two recent compressed CNN accelerators with the same area of 2.85 mm^2, CoDR decreases SRAM access by 5.08x and 7.99x, and consumes 3.76x and 6.84x less energy.
ROMANet: Fine-Grained Reuse-Driven Off-Chip Memory Access Management and Data Organization for Deep Neural Network Accelerators
Putra, Rachmad Vidya Wicaksana, Hanif, Muhammad Abdullah, Shafique, Muhammad
Enabling high energy efficiency is crucial for embedded implementations of deep learning. Several studies have shown that the DRAM-based off-chip memory accesses are one of the most energy-consuming operations in deep neural network (DNN) accelerators, and thereby limit the designs from achieving efficiency gains at the full potential. DRAM access energy varies depending upon the number of accesses required as well as the energy consumed per-access. Therefore, searching for a solution towards the minimum DRAM access energy is an important optimization problem. Towards this, we propose the ROMANet methodology that aims at reducing the number of memory accesses, by searching for the appropriate data partitioning and scheduling for each layer of a network using a design space exploration, based on the knowledge of the available on-chip memory and the data reuse factors. Moreover, ROMANet also targets decreasing the number of DRAM row buffer conflicts and misses, by exploiting the DRAM multi-bank burst feature to improve the energy-per-access. Besides providing the energy benefits, our proposed DRAM data mapping also results in an increased effective DRAM throughput, which is useful for latency-constraint scenarios. Our experimental results show that the ROMANet saves DRAM access energy by 12% for the AlexNet, by 36% for the VGG-16, and by 46% for the MobileNet, while also improving the DRAM throughput by 10%, as compared to the state-of-the-art.
- Asia > Indonesia (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Europe > Austria > Vienna (0.04)
- Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.04)
- Education (0.67)
- Energy (0.46)
- Information Technology (0.46)