T-REX: A 68-567 {\mu}s/token, 0.41-3.95 {\mu}J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET
Moon, Seunghyun, Li, Mao, Chen, Gregory, Knag, Phil, Krishnamurthy, Ram, Seok, Mingoo
–arXiv.org Artificial Intelligence
This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.
arXiv.org Artificial Intelligence
Feb-28-2025