Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Leroux, Nathan, Manea, Paul-Philipp, Sudarshan, Chirag, Finkbeiner, Jan, Siegel, Sebastian, Strachan, John Paul, Neftci, Emre

Nov-25-2024–arXiv.org Artificial Intelligence

Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-25-2024

arXiv.org PDF

Add feedback

Country:
- Europe
  - Germany (0.14)
  - Spain (0.14)
- North America > United States (0.14)

Genre:
- Research Report (0.83)

Industry:
- Energy (0.35)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)