AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies
–arXiv.org Artificial Intelligence
This paper presents the first comprehensive cross - architectural performance analysis of contemporary AI accelerators designed for LLM inference, introducing a novel workload - centric evaluation methodology that quantifies architectural fitness across operational regimes. We provide the first systematic comparison of memory hierarchies, compute architectures, and interconnect strategies across the full spectrum of commercial accelerators, from GPU - based designs to specialized wafer - scale engines. Our analysis reveals that no single architecture dominates across all workload categories, with performance variations of up to 3.7 between architectures depending on batch size and sequence length. We quantitatively evaluate four primary scaling strategies for trillion - parameter models, demonstrating that expert parallelism delivers the best parameter - to - compute ratio (8.4) but introduces 2.1 latency variance compared to tensor parallelism. This work provides system designers with actionable insights for accelerator selection based on workload characteristics, while identifying key architectural gaps in current designs that will shape future hardware development.
arXiv.org Artificial Intelligence
Jun-10-2025