Goto

Collaborating Authors

 clock frequency


NeuroCoreX: An Open-Source FPGA-Based Spiking Neural Network Emulator with On-Chip Learning

Gautam, Ashish, Date, Prasanna, Kulkarni, Shruti, Patton, Robert, Potok, Thomas

arXiv.org Artificial Intelligence

--Spiking Neural Networks (SNNs) are computational models inspired by the structure and dynamics of biological neuronal networks. Their event-driven nature enables them to achieve high energy efficiency, particularly when deployed on neuromorphic hardware platforms. Unlike conventional Artificial Neural Networks (ANNs), which primarily rely on layered architectures, SNNs naturally support a wide range of connectivity patterns, from traditional layered structures to small-world graphs characterized by locally dense and globally sparse connections. In this work, we introduce NeuroCoreX, an FPGA-based emulator designed for the flexible co-design and testing of SNNs. NeuroCoreX supports all-to-all connectivity, providing the capability to implement diverse network topologies without architectural restrictions. It features a biologically motivated local learning mechanism based on Spike-Timing-Dependent Plasticity (STDP). The neuron model implemented within NeuroCoreX is the Leaky Integrate-and-Fire (LIF) model, with current-based synapses facilitating spike integration and transmission . A Universal Asynchronous Receiver-Transmitter (UART) interface is provided for programming and configuring the network parameters, including neuron, synapse, and learning rule settings. NeuroCoreX is released as an open-source framework, aiming to accelerate research and development in energy-efficient, biologically inspired computing. One of the primary goals of neuromorphic computing is to emulate the structure and dynamics of biological neuronal networks, achieving both brain-like energy efficiency and high computational accuracy.


E4: Energy-Efficient DNN Inference for Edge Video Analytics Via Early-Exit and DVFS

Zhang, Ziyang, Zhao, Yang, Chang, Ming-Ching, Lin, Changyao, Liu, Jie

arXiv.org Artificial Intelligence

Deep neural network (DNN) models are increasingly popular in edge video analytic applications. However, the compute-intensive nature of DNN models pose challenges for energy-efficient inference on resource-constrained edge devices. Most existing solutions focus on optimizing DNN inference latency and accuracy, often overlooking energy efficiency. They also fail to account for the varying complexity of video frames, leading to sub-optimal performance in edge video analytics. In this paper, we propose an Energy-Efficient Early-Exit (E4) framework that enhances DNN inference efficiency for edge video analytics by integrating a novel early-exit mechanism with dynamic voltage and frequency scaling (DVFS) governors. It employs an attention-based cascade module to analyze video frame diversity and automatically determine optimal DNN exit points. Additionally, E4 features a just-in-time (JIT) profiler that uses coordinate descent search to co-optimize CPU and GPU clock frequencies for each layer before the DNN exit points. Extensive evaluations demonstrate that E4 outperforms current state-of-the-art methods, achieving up to 2.8x speedup and 26% average energy saving while maintaining high accuracy.


Shavette: Low Power Neural Network Acceleration via Algorithm-level Error Detection and Undervolting

Rinkinen, Mikael, Koskinen, Lauri, Silven, Olli, Safarpour, Mehdi

arXiv.org Artificial Intelligence

Reduced voltage operation is an effective technique for substantial energy efficiency improvement in digital circuits. This brief introduces a simple approach for enabling reduced voltage operation of Deep Neural Network (DNN) accelerators by mere software modifications. Conventional approaches for enabling reduced voltage operation e.g., Timing Error Detection (TED) systems, incur significant development costs and overheads, while not being applicable to the off-the-shelf components. Contrary to those, the solution proposed in this paper relies on algorithm-based error detection, and hence, is implemented with low development costs, does not require any circuit modifications, and is even applicable to commodity devices. By showcasing the solution through experimenting on popular DNNs, i.e., LeNet and VGG16, on a GPU platform, we demonstrate 18% to 25% energy saving with no accuracy loss of the models and negligible throughput compromise (< 3.9%), considering the overheads from integration of the error detection schemes into the DNN. The integration of presented algorithmic solution into the design is simpler when compared conventional TED based techniques that require extensive circuit-level modifications, cell library characterizations or special support from the design tools.


Embedded FPGA Developments in 130nm and 28nm CMOS for Machine Learning in Particle Detector Readout

Gonski, Julia, Gupta, Aseem, Jia, Haoyi, Kim, Hyunjoon, Rota, Lorenzo, Ruckman, Larry, Dragone, Angelo, Herbst, Ryan

arXiv.org Artificial Intelligence

Embedded field programmable gate array (eFPGA) technology allows the implementation of reconfigurable logic within the design of an application-specific integrated circuit (ASIC). This approach offers the low power and efficiency of an ASIC along with the ease of FPGA configuration, particularly beneficial for the use case of machine learning in the data pipeline of next-generation collider experiments. An open-source framework called "FABulous" was used to design eFPGAs using 130 nm and 28 nm CMOS technology nodes, which were subsequently fabricated and verified through testing. The capability of an eFPGA to act as a front-end readout chip was assessed using simulation of high energy particles passing through a silicon pixel sensor. A machine learning-based classifier, designed for reduction of sensor data at the source, was synthesized and configured onto the eFPGA. A successful proof-of-concept was demonstrated through reproduction of the expected algorithm result on the eFPGA with perfect accuracy. Further development of the eFPGA technology and its application to collider detector readout is discussed.


Quantized Context Based LIF Neurons for Recurrent Spiking Neural Networks in 45nm

Bezugam, Sai Sukruth, Wu, Yihao, Yoo, JaeBum, Strukov, Dmitri, Kim, Bongjin

arXiv.org Artificial Intelligence

In this study, we propose the first hardware implementation of a context-based recurrent spiking neural network (RSNN) emphasizing on integrating dual information streams within the neocortical pyramidal neurons specifically Context- Dependent Leaky Integrate and Fire (CLIF) neuron models, essential element in RSNN. We present a quantized version of the CLIF neuron (qCLIF), developed through a hardware-software codesign approach utilizing the sparse activity of RSNN. Implemented in a 45nm technology node, the qCLIF is compact (900um^2) and achieves a high accuracy of 90% despite 8 bit quantization on DVS gesture classification dataset. Our analysis spans a network configuration from 10 to 200 qCLIF neurons, supporting up to 82k synapses within a 1.86 mm^2 footprint, demonstrating scalability and efficiency


Fast Inner-Product Algorithms and Architectures for Deep Neural Network Accelerators

Pogue, Trevor E., Nicolici, Nicola

arXiv.org Artificial Intelligence

We introduce a new algorithm called the Free-pipeline Fast Inner Product (FFIP) and its hardware architecture that improve an under-explored fast inner-product algorithm (FIP) proposed by Winograd in 1968. Unlike the unrelated Winograd minimal filtering algorithms for convolutional layers, FIP is applicable to all machine learning (ML) model layers that can mainly decompose to matrix multiplication, including fully-connected, convolutional, recurrent, and attention/transformer layers. We implement FIP for the first time in an ML accelerator then present our FFIP algorithm and generalized architecture which inherently improve FIP's clock frequency and, as a consequence, throughput for a similar hardware cost. Finally, we contribute ML-specific optimizations for the FIP and FFIP algorithms and architectures. We show that FFIP can be seamlessly incorporated into traditional fixed-point systolic array ML accelerators to achieve the same throughput with half the number of multiply-accumulate (MAC) units, or it can double the maximum systolic array size that can fit onto devices with a fixed hardware budget. Our FFIP implementation for non-sparse ML models with 8 to 16-bit fixed-point inputs achieves higher throughput and compute efficiency than the best-in-class prior solutions on the same type of compute platform.


C3S Micro-architectural Enhancement: Spike Encoder Block and Relaxing Gamma Clock (Asynchronous)

Anand, Alok, Khokhlov, Ivan, Anand, Abhishek

arXiv.org Artificial Intelligence

The field of neuromorphic computing is rapidly evolving. As both biological accuracy and practical implementations are explored, existing architectures are modified and improved for both purposes. The Temporal Neural Network(TNN) style of architecture is a good basis for approximating biological neurons due to its use of timed pulses to encode data and a voltage-threshold-like system. Using the Temporal Neural Network cortical column C3S architecture design as a basis, this project seeks to augment the network's design. This project takes note of two ideas and presents their designs with the goal of improving existing cortical column architecture. One need in this field is for an encoder that could convert between common digital formats and timed neuronal spikes, as biologically accurate networks are temporal in nature. To this end, this project presents an encoder to translate between binary encoded values and timed spikes to be processed by the neural network. Another need is for the reduction of wasted processing time to idleness, caused by lengthy Gamma cycle processing bursts. To this end, this project presents a relaxation of Gamma cycles to allow for them to end arbitrarily early once the network has determined an output response. With the goal of contributing to the betterment of the field of neuromorphic computer architecture, designs for both a binary-to-spike encoder, as well as a Gamma cycle controller, are presented and evaluated for optimal design parameters, with overall system gain and performance.


ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining

Peltekis, C., Filippas, D., Dimitrakopoulos, G., Nicopoulos, C., Pnevmatikatos, D.

arXiv.org Artificial Intelligence

Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. For maximum scalability, their computation should combine high performance and energy efficiency. In practice, the convolutions of each CNN layer are mapped to a matrix multiplication that includes all input features and kernels of each layer and is computed using a systolic array. In this work, we focus on the design of a systolic array with configurable pipeline with the goal to select an optimal pipeline configuration for each CNN layer. The proposed systolic array, called ArrayFlex, can operate in normal, or in shallow pipeline mode, thus balancing the execution time in cycles and the operating clock frequency. By selecting the appropriate pipeline configuration per CNN layer, ArrayFlex reduces the inference latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array. Most importantly, this result is achieved while using 13%-23% less power, for the same applications, thus offering a combined energy-delay-product efficiency between 1.4x and 1.8x.


Implementing Neural Network-Based Equalizers in a Coherent Optical Transmission System Using Field-Programmable Gate Arrays

Freire, Pedro J., Srivallapanondh, Sasipim, Anderson, Michael, Spinnler, Bernhard, Bex, Thomas, Eriksson, Tobias A., Napoli, Antonio, Schairer, Wolfgang, Costa, Nelson, Blott, Michaela, Turitsyn, Sergei K., Prilepsky, Jaroslaw E.

arXiv.org Artificial Intelligence

In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 200G and 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.


Dynamic GPU Energy Optimization for Machine Learning Training Workloads

Wang, Farui, Zhang, Weizhe, Lai, Shichao, Hao, Meng, Wang, Zheng

arXiv.org Artificial Intelligence

GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.