crossbar
MDM: Manhattan Distance Mapping of DNN Weights for Parasitic-Resistance-Resilient Memristive Crossbars
Farias, Matheus, Martins, Wanghley, Kung, H. T.
Manhattan Distance Mapping (MDM) is a post-training deep neural network (DNN) weight mapping technique for memristive bit-sliced compute-in-memory (CIM) crossbars that reduces parasitic resistance (PR) nonidealities. PR limits crossbar efficiency by mapping DNN matrices into small crossbar tiles, reducing CIM-based speedup. Each crossbar executes one tile, requiring digital synchronization before the next layer. At this granularity, designers either deploy many small crossbars in parallel or reuse a few sequentially-both increasing analog-to-digital conversions, latency, I/O pressure, and chip area. MDM alleviates PR effects by optimizing active-memristor placement. Exploiting bit-level structured sparsity, it feeds activations from the denser low-order side and reorders rows according to the Manhattan distance, relocating active cells toward regions less affected by PR and thus lowering the nonideality factor (NF). Applied to DNN models on ImageNet-1k, MDM reduces NF by up to 46% and improves accuracy under analog distortion by an average of 3.6% in ResNets. Overall, it provides a lightweight, spatially informed method for scaling CIM DNN accelerators.
OpenMENA: An Open-Source Memristor Interfacing and Compute Board for Neuromorphic Edge-AI Applications
Safa, Ali, Mohsen, Farida, Ali, Zainab, Wang, Bo, Bermak, Amine
Abstract--Memristive crossbars enable in-memory multiply-accumulate and local plasticity learning, offering a path to energy-efficient edge AI. T o this end, we present Open-MENA (Open Mimristor-in-Memory Accelerator), which, to our knowledge, is the first fully open memristor interfacing system integrating (i) a reproducible hardware interface for memris-tor crossbars with mixed-signal read-program-verify loops; (ii) a firmware-software stack with high-level APIs for inference and on-device learning; and (iii) a V oltage-Incremental Proportional-Integral (VIPI) method to program pre-trained weights into analog conductances, followed by chip-in-the-loop fine-tuning to mitigate device non-idealities. OpenMENA is validated on digit recognition, demonstrating the flow from weight transfer to on-device adaptation, and on a real-world robot obstacle-avoidance task, where the memristor-based model learns to map localization inputs to motor commands. OpenMENA is released as open source to democratize memristor-enabled edge-AI research. We release all hardware design and software material as open source at: https://tinyurl.com/mr592wuf
Efficient and Encrypted Inference using Binarized Neural Networks within In-Memory Computing Architectures
Rajendran, Gokulnath, Deb, Suman, Chattopadhyay, Anupam
Binarized Neural Networks (BNNs) are a class of deep neural networks designed to utilize minimal computational resources, which drives their popularity across various applications. Recent studies highlight the potential of mapping BNN model parameters onto emerging non-volatile memory technologies, specifically using crossbar architectures, resulting in improved inference performance compared to traditional CMOS implementations. However, the common practice of protecting model parameters from theft attacks by storing them in an encrypted format and decrypting them at runtime introduces significant computational overhead, thus undermining the core principles of in-memory computing, which aim to integrate computation and storage. This paper presents a robust strategy for protecting BNN model parameters, particularly within in-memory computing frameworks. Our method utilizes a secret key derived from a physical unclonable function to transform model parameters prior to storage in the crossbar. Subsequently, the inference operations are performed on the encrypted weights, achieving a very special case of Fully Homomorphic Encryption (FHE) with minimal runtime overhead. Our analysis reveals that inference conducted without the secret key results in drastically diminished performance, with accuracy falling below 15%. These results validate the effectiveness of our protection strategy in securing BNNs within in-memory computing architectures while preserving computational efficiency.
Multi-Objective Optimization of ReRAM Crossbars for Robust DNN Inferencing under Stochastic Noise
Yang, Xiaoxuan, Belakaria, Syrine, Joardar, Biresh Kumar, Yang, Huanrui, Doppa, Janardhan Rao, Pande, Partha Pratim, Chakrabarty, Krishnendu, Li, Hai
--Resistive random-access memory (ReRAM) is a promising technology for designing hardware accelerators for deep neural network (DNN) inferencing. We propose the design and optimization of a high-performance, area-and energy-efficient ReRAMbased hardware accelerator to achieve robust DNN inferencing in the presence of stochastic noise. We make two key technical contributions. First, we propose a stochastic-noise-aware training method, referred to as ReSNA, to improve the accuracy of DNN inferencing on ReRAM crossbars with stochastic noise. Second, we propose an information-theoretic algorithm, referred to as CF-MESMO, to identify the Pareto set of solutions to trade-off multiple objectives, including inferencing accuracy, area overhead, execution time, and energy consumption. The main challenge in this context is that executing the ReSNA method to evaluate each candidate ReRAM design is prohibitive. T o address this challenge, we utilize the continuous-fidelity evaluation of ReRAM designs associated with prohibitive high computation cost by varying the number of training epochs to trade-off accuracy and cost. CF-MESMO iteratively selects the candidate ReRAM design and fidelity pair that maximizes the information gained per unit computation cost about the optimal Pareto front. Our experiments on benchmark DNNs show that the proposed algorithms efficiently uncover high-quality Pareto fronts. On average, ReSNA achieves 2. 57% inferencing accuracy improvement for ResNet20 on the CIF AR-10 dataset with respect to the baseline configuration. Moreover, CF-MESMO algorithm achieves 90. Resistive random access memory (ReRAM) has emerged as a promising nonvolatile memory technology due to its multi-level cell, small cell size, and low access time and energy consumption. Prior work has shown that the crossbar structure of ReRAM arrays can efficiently execute matrix-vector multiplication [1], [2], the predominant computational kernel associated with deep neural networks (DNNs). ReRAM-based accelerators for fast and efficient DNN training and inferencing have been extensively studied [3]-[8]. However, a key challenge in executing DNN inferencing [9]- [11] on ReRAM-based architecture arises due to nonidealities of ReRAM devices, which can degrade the accuracy of inferencing.
Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach
Rossenbach, Nick, Hilmes, Benedikt, Brackmann, Leon, Gunz, Moritz, Schlüter, Ralf
Memristor-based hardware offers new possibilities for energy-efficient machine learning (ML) by providing analog in-memory matrix multiplication. Current hardware prototypes cannot fit large neural networks, and related literature covers only small ML models for tasks like MNIST or single word recognition. Simulation can be used to explore how hardware properties affect larger models, but existing software assumes simplified hardware. We propose a PyTorch-based library based on "Synaptogen" to simulate neural network execution with accurately captured memristor hardware properties. For the first time, we show how an ML system with millions of parameters would behave on memristor hardware, using a Conformer trained on the speech recognition task TED-LIUMv2 as example. With adjusted quantization-aware training, we limit the relative degradation in word error rate to 25% when using a 3-bit weight precision to execute linear operations via simulated analog computation.
Optimizing Binary and Ternary Neural Network Inference on RRAM Crossbars using CIM-Explorer
Pelke, Rebecca, Cubero-Cascante, José, Bosbach, Nils, Degener, Niklas, Idrizi, Florian, Reimann, Lennart M., Joseph, Jan Moritz, Leupers, Rainer
Using Resistive Random Access Memory (RRAM) crossbars in Computing-in-Memory (CIM) architectures offers a promising solution to overcome the von Neumann bottleneck. Due to non-idealities like cell variability, RRAM crossbars are often operated in binary mode, utilizing only two states: Low Resistive State (LRS) and High Resistive State (HRS). Binary Neural Networks (BNNs) and Ternary Neural Networks (TNNs) are well-suited for this hardware due to their efficient mapping. Existing software projects for RRAM-based CIM typically focus on only one aspect: compilation, simulation, or Design Space Exploration (DSE). Moreover, they often rely on classical 8 bit quantization. To address these limitations, we introduce CIM-Explorer, a modular toolkit for optimizing BNN and TNN inference on RRAM crossbars. CIM-Explorer includes an end-to-end compiler stack, multiple mapping options, and simulators, enabling a DSE flow for accuracy estimation across different crossbar parameters and mappings. CIM-Explorer can accompany the entire design process, from early accuracy estimation for specific crossbar parameters, to selecting an appropriate mapping, and compiling BNNs and TNNs for a finalized crossbar chip. In DSE case studies, we demonstrate the expected accuracy for various mappings and crossbar parameters. CIM-Explorer can be found on GitHub.
Low-power Spike-based Wearable Analytics on RRAM Crossbars
Bhattacharjee, Abhiroop, Shi, Jinquan, Chen, Wei-Chen, Wang, Xinxin, Panda, Priyadarshini
Abstract--This work introduces a spike-based wearable analytics system utilizing Spiking Neural Networks (SNNs) deployed on an In-memory Computing engine based on RRAM crossbars, which are known for their compactness and energy-efficiency. Given the hardware constraints and noise characteristics of the underlying RRAM crossbars, we propose online adaptation of pre-trained SNNs in real-time using Direct Feedback Alignment (DFA) against traditional backpropagation (BP). Direct Feedback Alignment (DFA) learning, that allows layer-parallel gradient computations, acts as a fast, energy & area-efficient method for online adaptation of SNNs on RRAM crossbars, unleashing better algorithmic performance against those adapted using BP. Through extensive simulations using our in-house hardware evaluation engine called DF A Sim, we find that DFA achieves upto 64.1% lower energy consumption, 10.1% lower area overhead, and a 2.1 reduction in latency compared to BP, while delivering upto 7.55% higher inference accuracy on human Figure 1: Pictorial depiction of SNNs used in wearables for temporal activity recognition (HAR) tasks. Pre-trained SNNs in the cloud are adapted online according to the constraints of resource-constrained edge devices.
Learning in Log-Domain: Subthreshold Analog AI Accelerator Based on Stochastic Gradient Descent
Tageldeen, Momen K, Belgaid, Yacine, Mohan, Vivek, Wang, Zhou, Drakakis, Emmanuel M
In recent years, artificial intelligence (AI) has become an integral part of daily life, serving as a transformative tool across various professional domains [1] and driving personal applications through advancements in transformer models that power large language models (LLMs) [2]. However, both training and inference of AI models demand substantial computational and energy resources, which are becoming increasingly challenging to access [3, 4]. While server-class GPUs are effective for training, their energy inefficiency [5] and high costs present significant barriers [6]. Additionally, the environmental impact of energy-intensive AI systems has raised critical concerns about their role in exacerbating climate change [4]. Amdahl's law predicts that performance and efficiency gains are best achieved through innovative application-specific accelerator architectures rather than scaling up multi-core general-purpose processors [7]. Consequently, applicationspecific integrated circuits (ASICs), both digital and analog, have emerged as critical solutions for enabling highefficiency training and inference of artificial neural networks [7, 8, 9]. Digital accelerators are widely adopted for training workloads. Notable examples include the Brainwave Neural Processing Unit (NPU) [10], Google's Tensor Processing Unit (TPU) [11], and low-precision inference accelerators such as YodaNN [5], the Unified Neural Processing Unit (UNPU) [12], and BRein Memory [13].
Current Opinions on Memristor-Accelerated Machine Learning Hardware
Jiang, Mingrui, Xu, Yichun, Li, Zefan, Li, Can
The unprecedented advancement of artificial intelligence has placed immense demands on computing hardware, but traditional silicon-based semiconductor technologies are approaching their physical and economic limit, prompting the exploration of novel computing paradigms. Memristor offers a promising solution, enabling in-memory analog computation and massive parallelism, which leads to low latency and power consumption. This manuscript reviews the current status of memristor-based machine learning accelerators, highlighting the milestones achieved in developing prototype chips, that not only accelerate neural networks inference but also tackle other machine learning tasks. More importantly, it discusses our opinion on current key challenges that remain in this field, such as device variation, the need for efficient peripheral circuitry, and systematic co-design and optimization. We also share our perspective on potential future directions, some of which address existing challenges while others explore untouched territories. By addressing these challenges through interdisciplinary efforts spanning device engineering, circuit design, and systems architecture, memristor-based accelerators could significantly advance the capabilities of AI hardware, particularly for edge applications where power efficiency is paramount.
Efficient Reprogramming of Memristive Crossbars for DNNs: Weight Sorting and Bit Stucking
We introduce a novel approach to reduce the number of times required for reprogramming memristors on bit-sliced compute-in-memory crossbars for deep neural networks (DNNs). Our idea addresses the limited non-volatile memory endurance, which restrict the number of times they can be reprogrammed. To reduce reprogramming demands, we employ two techniques: (1) we organize weights into sorted sections to schedule reprogramming of similar crossbars, maximizing memristor state reuse, and (2) we reprogram only a fraction of randomly selected memristors in low-order columns, leveraging their bit-level distribution and recognizing their relatively small impact on model accuracy. We evaluate our approach for state-of-the-art models on the ImageNet-1K dataset. We demonstrate a substantial reduction in crossbar reprogramming by 3.7x for ResNet-50 and 21x for ViT-Base, while maintaining model accuracy within a 1% margin.