Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation
Zarkadas, Ioannis, Tomlinson, Amanda, Cidon, Asaf, Kasikci, Baris, Weisse, Ofir
–arXiv.org Artificial Intelligence
These portable mid-level representations are then compiled into the byte-code which runs on the ML accelerator. The As models become larger, ML accelerators are a scarce resource development of each of these levels of abstraction requires a whose performance must be continually optimized to huge engineering effort, and inefficiencies introduced at any improve efficiency. Existing performance analysis tools are level can cause performance degradation for the model. The coarse grained, and fail to capture model performance at the companies that offer generative AI services are often doing so machine-code level. In addition, these tools often do not provide at a massive scale (for example, the infrastructure to provide specific recommendations for optimizations. We present inference for Microsoft's Bing AI chatbot is estimated to cost xPU-Shark, a fine-grained methodology for analyzing ML $4 billion [57]), meaning that even a small degradation in models at the machine-code level that provides actionable optimization performance can lead to large capital losses.
arXiv.org Artificial Intelligence
Mar-18-2025
- Country:
- North America > United States > California (0.14)
- Industry:
- Information Technology (0.72)
- Technology: