in-memory computing
Efficient Deployment of Transformer Models in Analog In-Memory Computing Hardware
Li, Chen, Lammie, Corey, Gallo, Manuel Le, Rajendran, Bipin
Analog in-memory computing (AIMC) has emerged as a promising solution to overcome the von Neumann bottleneck, accelerating neural network computations and improving computational efficiency. While AIMC has demonstrated success with architectures such as CNNs, MLPs, and RNNs, deploying transformer-based models using AIMC presents unique challenges. Transformers are expected to handle diverse downstream tasks and adapt to new user data or instructions after deployment, which requires more flexible approaches to suit AIMC constraints. In this paper, we propose a novel method for deploying pre-trained transformer models onto AIMC hardware. Unlike traditional approaches requiring hardware-aware training, our technique allows direct deployment without the need for retraining the original model. Instead, we utilize lightweight, low-rank adapters -- compact modules stored in digital cores -- to adapt the model to hardware constraints. We validate our approach on MobileBERT, demonstrating accuracy on par with, or even exceeding, a traditional hardware-aware training approach. Our method is particularly appealing in multi-task scenarios, as it enables a single analog model to be reused across multiple tasks. Moreover, it supports on-chip adaptation to new hardware constraints and tasks without updating analog weights, providing a flexible and versatile solution for real-world AI applications. Code is available.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
Approximate ADCs for In-Memory Computing
Ghosh, Arkapravo, Sadana, Hemkar Reddy, Debnath, Mukut, Maji, Panthadip, Negi, Shubham, Gupta, Sumeet, Sharad, Mrigank, Roy, Kaushik
In memory computing (IMC) architectures for deep learning (DL) accelerators leverage energy-efficient and highly parallel matrix vector multiplication (MVM) operations, implemented directly in memory arrays. Such IMC designs have been explored based on CMOS as well as emerging non-volatile memory (NVM) technologies like RRAM. IMC architectures generally involve a large number of cores consisting of memory arrays, storing the trained weights of the DL model. Peripheral units like DACs and ADCs are also used for applying inputs and reading out the output values. Recently reported designs reveal that the ADCs required for reading out the MVM results, consume more than 85% of the total compute power and also dominate the area, thereby eschewing the benefits of the IMC scheme. Mitigation of imperfections in the ADCs, namely, non-linearity and variations, incur significant design overheads, due to dedicated calibration units. In this work we present peripheral aware design of IMC cores, to mitigate such overheads. It involves incorporating the non-idealities of ADCs in the training of the DL models, along with that of the memory units. The proposed approach applies equally well to both current mode as well as charge mode MVM operations demonstrated in recent years., and can significantly simplify the design of mixed-signal IMC units.
- Europe (0.04)
- Asia > India > West Bengal > Kharagpur (0.04)
Topology Optimization of Random Memristors for Input-Aware Dynamic SNN
Wang, Bo, Wang, Shaocong, Lin, Ning, Li, Yi, Yu, Yifei, Zhang, Yue, Yang, Jichang, Wu, Xiaoshan, He, Yangu, Wang, Songqi, Chen, Rui, Li, Guoqi, Qi, Xiaojuan, Wang, Zhongrui, Shang, Dashan
There is unprecedented development in machine learning, exemplified by recent large language models and world simulators, which are artificial neural networks running on digital computers. However, they still cannot parallel human brains in terms of energy efficiency and the streamlined adaptability to inputs of different difficulties, due to differences in signal representation, optimization, run-time reconfigurability, and hardware architecture. To address these fundamental challenges, we introduce pruning optimization for input-aware dynamic memristive spiking neural network (PRIME). Signal representation-wise, PRIME employs leaky integrate-and-fire neurons to emulate the brain's inherent spiking mechanism. Drawing inspiration from the brain's structural plasticity, PRIME optimizes the topology of a random memristive spiking neural network without expensive memristor conductance fine-tuning. For runtime reconfigurability, inspired by the brain's dynamic adjustment of computational depth, PRIME employs an input-aware dynamic early stop policy to minimize latency during inference, thereby boosting energy efficiency without compromising performance. Architecture-wise, PRIME leverages memristive in-memory computing, mirroring the brain and mitigating the von Neumann bottleneck. We validated our system using a 40 nm 256 Kb memristor-based in-memory computing macro on neuromorphic image classification and image inpainting. Our results demonstrate the classification accuracy and Inception Score are comparable to the software baseline, while achieving maximal 62.50-fold improvements in energy efficiency, and maximal 77.0% computational load savings. The system also exhibits robustness against stochastic synaptic noise of analogue memristors. Our software-hardware co-designed model paves the way to future brain-inspired neuromorphic computing with brain-like energy efficiency and adaptivity.
- Asia > China > Hong Kong (0.05)
- North America > United States > Texas (0.05)
- Asia > China > Beijing > Beijing (0.05)
- Semiconductors & Electronics (0.46)
- Health & Medicine > Therapeutic Area (0.46)
A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit for Analog In-Memory Computing
Ferro, Elena, Vasilopoulos, Athanasios, Lammie, Corey, Gallo, Manuel Le, Benini, Luca, Boybat, Irem, Sebastian, Abu
Analog In-Memory Computing (AIMC) is an emerging technology for fast and energy-efficient Deep Learning (DL) inference. However, a certain amount of digital post-processing is required to deal with circuit mismatches and non-idealities associated with the memory devices. Efficient near-memory digital logic is critical to retain the high area/energy efficiency and low latency of AIMC. Existing systems adopt Floating Point 16 (FP16) arithmetic with limited parallelization capability and high latency. To overcome these limitations, we propose a Near-Memory digital Processing Unit (NMPU) based on fixed-point arithmetic. It achieves competitive accuracy and higher computing throughput than previous approaches while minimizing the area overhead. Moreover, the NMPU supports standard DL activation steps, such as ReLU and Batch Normalization. We perform a physical implementation of the NMPU design in a 14 nm CMOS technology and provide detailed performance, power, and area assessments. We validate the efficacy of the NMPU by using data from an AIMC chip and demonstrate that a simulated AIMC system with the proposed NMPU outperforms existing FP16-based implementations, providing 139$\times$ speed-up, 7.8$\times$ smaller area, and a competitive power consumption. Additionally, our approach achieves an inference accuracy of 86.65 %/65.06 %, with an accuracy drop of just 0.12 %/0.4 % compared to the FP16 baseline when benchmarked with ResNet9/ResNet32 networks trained on the CIFAR10/CIFAR100 datasets, respectively.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- North America > United States > Massachusetts > Middlesex County > Waltham (0.04)
Pruning random resistive memory for optimizing analogue AI
Li, Yi, Wang, Songqi, Zhao, Yaping, Wang, Shaocong, Zhang, Woyu, He, Yangu, Lin, Ning, Cui, Binbin, Chen, Xi, Zhang, Shiming, Jiang, Hao, Lin, Peng, Zhang, Xumeng, Qi, Xiaojuan, Wang, Zhongrui, Xu, Xiaoxin, Shang, Dashan, Liu, Qi, Cheng, Kwang-Ting, Liu, Ming
The rapid advancement of artificial intelligence (AI) has been marked by the large language models exhibiting human-like intelligence. However, these models also present unprecedented challenges to energy consumption and environmental sustainability. One promising solution is to revisit analogue computing, a technique that predates digital computing and exploits emerging analogue electronic devices, such as resistive memory, which features in-memory computing, high scalability, and nonvolatility. However, analogue computing still faces the same challenges as before: programming nonidealities and expensive programming due to the underlying devices physics. Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning to optimize the topology of a randomly weighted analogue resistive memory neural network. Software-wise, the topology of a randomly weighted neural network is optimized by pruning connections rather than precisely tuning resistive memory weights. Hardware-wise, we reveal the physical origin of the programming stochasticity using transmission electron microscopy, which is leveraged for large-scale and low-cost implementation of an overparameterized random neural network containing high-performance sub-networks. We implemented the co-design on a 40nm 256K resistive memory macro, observing 17.3% and 19.9% accuracy improvements in image and audio classification on FashionMNIST and Spoken digits datasets, as well as 9.8% (2%) improvement in PR (ROC) in image segmentation on DRIVE datasets, respectively. This is accompanied by 82.1%, 51.2%, and 99.8% improvement in energy efficiency thanks to analogue in-memory computing. By embracing the intrinsic stochasticity and in-memory computing, this work may solve the biggest obstacle of analogue computing systems and thus unleash their immense potential for next-generation AI hardware.
- Asia > China > Hong Kong (0.05)
- North America > United States > Texas (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (4 more...)
- Semiconductors & Electronics (1.00)
- Education (0.93)
- Health & Medicine > Diagnostic Medicine (0.46)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Enabling In-Memory Computing for Artificial Intelligence Part 1: The Analog Approach - Intel Communities
Hechen Wang is a research scientist for Intel Labs with interests in mixed-signal circuits, data converters, digital frequency synthesizers, wireless communication systems, and analog/mixed-signal compute-in-memory for AI applications. The fundamental building block of computer memory is the memory cell; an electronic circuit that stores binary information. In the conventional approach to data processing, the data resides on a hard disk in the system or attached by a network. When needed, it's called into the local system memory, or RAM, and then moves to the CPU. The lengthy process is relatively inefficient, so researchers began to seek an alternative.
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.74)
- Information Technology > Hardware > Memory (0.69)
- Information Technology > Communications > Networks (0.69)
Interconnect Parasitics and Partitioning in Fully-Analog In-Memory Computing Architectures
Amin, Md Hasibul, Elbtity, Mohammed, Zand, Ramtin
Fully-analog in-memory computing (IMC) architectures that implement both matrix-vector multiplication and non-linear vector operations within the same memory array have shown promising performance benefits over conventional IMC systems due to the removal of energy-hungry signal conversion units. However, maintaining the computation in the analog domain for the entire deep neural network (DNN) comes with potential sensitivity to interconnect parasitics. Thus, in this paper, we investigate the effect of wire parasitic resistance and capacitance on the accuracy of DNN models deployed on fully-analog IMC architectures. Moreover, we propose a partitioning mechanism to alleviate the impact of the parasitic while keeping the computation in the analog domain through dividing large arrays into multiple partitions. The SPICE circuit simulation results for a 400 X 120 X 84 X 10 DNN model deployed on a fully-analog IMC circuit show that a 94.84% accuracy could be achieved for MNIST classification application with 16, 8, and 8 horizontal partitions, as well as 8, 8, and 1 vertical partitions for first, second, and third layers of the DNN, respectively, which is comparable to the ~97% accuracy realized by digital implementation on CPU. It is shown that accuracy benefits are achieved at the cost of higher power consumption due to the extra circuitry required for handling partitioning.
- North America > United States > South Carolina > Richland County > Columbia (0.14)
- North America > United States > New York > New York County > New York City (0.04)
Samsung is working on artificial intelligence chips that use in-memory computing.
Samsung Electronics has announced the development of an in-memory computing system that combines memory and system semiconductors. For the first time, non-volatile memories, dubbed "magnetoresistive random access memory," are being used to enable the new technology, according to the world's largest memory chipmaker. Samsung has announced the development of an in-memory computing technology that combines memory and system semiconductors. For the first time in the world, non-volatile memories, dubbed "magnetoresistive random access memory," are enabling the new technology, according to the world's largest memory chipmaker. Data is stored in memory chips and computed by separate processor chips in a traditional computer architecture.
Rounding Up Machine Learning Developments From 2020
The year 2020 saw many exciting developments in machine learning. As the year 2020 comes to an end, here is a roundup of these innovations in various machine learning domains such as reinforcement learning, Natural Language Processing, ML frameworks such as Pytorch and TensorFlow, and more. Arm-based Graviton processors went mainstream in 2020, which utilize 30 billion transistors with 64-bit Arm cores built by Israeli-based engineering company Annapurna Labs. AWS recently acquired it for powering memory-intensive workloads like real-time big data analytics. It showed a 40% performance improvement emerging as an alternative to x86-based processors for machine learning, shifting the trend from the Intel-dominated cloud market to Arm-based Graviton processors.
BANKING: MAKING AI IN CUSTOMER SERVICE A REALITY
Banks are constantly looking for opportunities to up- or cross-sell products to customers. Increasing product penetration from 2.5 products to 4 products per customer can add millions to the bottom line and it is estimated to be 5 – 10 times cheaper to up- or cross-sell to an existing customer than to acquire a new one. Combining in-memory computing with AI opens up new opportunities to do so. When it comes to engaging customers in up- or cross-selling conversations, timing is everything. Customers are far more likely to be receptive to an approach when they are already interacting with the bank – online, via the telephone, or in branch.