SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors

Rakka, Mariam, Li, Jinhao, Dai, Guohao, Eltawil, Ahmed, Fouda, Mohammed E., Kurdahi, Fadi

arXiv.org Artificial Intelligence 

Abstract--Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a softwarehardware co-design methodology that implements an integeronly low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance. Softmax contributes up to 38% of the run time for longer sequence lengths.