Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation

Zarkadas, Ioannis, Tomlinson, Amanda, Cidon, Asaf, Kasikci, Baris, Weisse, Ofir

Mar-18-2025–arXiv.org Artificial Intelligence

These portable mid-level representations are then compiled into the byte-code which runs on the ML accelerator. The As models become larger, ML accelerators are a scarce resource development of each of these levels of abstraction requires a whose performance must be continually optimized to huge engineering effort, and inefficiencies introduced at any improve efficiency. Existing performance analysis tools are level can cause performance degradation for the model. The coarse grained, and fail to capture model performance at the companies that offer generative AI services are often doing so machine-code level. In addition, these tools often do not provide at a massive scale (for example, the infrastructure to provide specific recommendations for optimizations. We present inference for Microsoft's Bing AI chatbot is estimated to cost xPU-Shark, a fine-grained methodology for analyzing ML $4 billion [57]), meaning that even a small degradation in models at the machine-code level that provides actionable optimization performance can lead to large capital losses.

accelerator, instruction, utilization, (15 more...)

arXiv.org Artificial Intelligence

Mar-18-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.14)

Industry:
- Information Technology (0.72)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.34)
  - Natural Language > Chatbot (0.86)
  - Representation & Reasoning (1.00)