An Investigation of FP8 Across Accelerators for LLM Inference

Kim, Jiwoo, Lee, Joonhyung, Park, Gunho, Kim, Byeongwook, Kwon, Se Jung, Lee, Dongsoo, Lee, Youngjoo

Feb-5-2025–arXiv.org Artificial Intelligence

The introduction of 8-bit floating-point (FP8) computation units in modern AI accelerators has generated significant interest in FP8-based large language model (LLM) inference. Unlike 16-bit floating-point formats, FP8 in deep learning requires a shared scaling factor. Additionally, while E4M3 and E5M2 are well-defined at the individual value level, their scaling and accumulation methods remain unspecified and vary across hardware and software implementations. As a result, FP8 behaves more like a quantization format than a standard numeric representation. In this work, we provide the first comprehensive analysis of FP8 computation and acceleration on two AI accelerators: the NVIDIA H100 and Intel Gaudi 2. Our findings highlight that the Gaudi 2, by leveraging FP8, achieves higher throughput-to-power efficiency during LLM inference, offering valuable insights into the practical implications of FP8 adoption for datacenter-scale LLM serving.

large language model, machine learning, throughput, (19 more...)

arXiv.org Artificial Intelligence

Feb-5-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)
- North America > United States (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)