PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers

Yeo, Gwangoo, Kim, Jiin, Choi, Yujeong, Rhu, Minsoo

Nov-28-2024–arXiv.org Artificial Intelligence

NVIDIA's Multi-Instance GPU (MIG) is a feature that enables system designers to reconfigure one large GPU into multiple smaller GPU slices. This work characterizes this emerging GPU and evaluates its effectiveness in designing high-performance AI inference servers. Our study reveals that the data preprocessing stage of AI inference causes significant performance bottlenecks to MIG. To this end, we present PREBA, which is a hardware/software co-design targeting MIG inference servers. Our first proposition is an FPGA-based data preprocessing accelerator that unlocks the full potential of MIG with domain-specific acceleration of data preprocessing. The MIG inference server unleashed from preprocessing overheads is then augmented with our dynamic batching system that enables high-performance inference. PREBA is implemented end-to-end in real systems, providing a 3.7x improvement in throughput, 3.4x reduction in tail latency, 3.5x improvement in energy-efficiency, and 3.0x improvement in cost-efficiency.

inference server, proceedings, throughput, (13 more...)

arXiv.org Artificial Intelligence

Nov-28-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East > Jordan (0.04)

Genre:
- Research Report (0.64)

Industry:
- Information Technology (0.92)

Technology:
- Information Technology
  - Hardware (1.00)
  - Graphics (1.00)
  - Software (0.93)
  - Artificial Intelligence > Machine Learning
    - Neural Networks > Deep Learning (1.00)