AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies

Jun-10-2025–arXiv.org Artificial Intelligence

This paper presents the first comprehensive cross - architectural performance analysis of contemporary AI accelerators designed for LLM inference, introducing a novel workload - centric evaluation methodology that quantifies architectural fitness across operational regimes. We provide the first systematic comparison of memory hierarchies, compute architectures, and interconnect strategies across the full spectrum of commercial accelerators, from GPU - based designs to specialized wafer - scale engines. Our analysis reveals that no single architecture dominates across all workload categories, with performance variations of up to 3.7 between architectures depending on batch size and sequence length. We quantitatively evaluate four primary scaling strategies for trillion - parameter models, demonstrating that expert parallelism delivers the best parameter - to - compute ratio (8.4) but introduces 2.1 latency variance compared to tensor parallelism. This work provides system designers with actionable insights for accelerator selection based on workload characteristics, while identifying key architectural gaps in current designs that will shape future hardware development.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jun-10-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)
- Overview (0.68)

Industry:
- Information Technology (0.71)

Technology:
- Information Technology
  - Hardware (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found