Iris: First-Class Multi-GPU Programming Experience in Triton
Awad, Muhammad, Osama, Muhammad, Potter, Brandon
–arXiv.org Artificial Intelligence
Multi-GPU programming traditionally requires developers to navigate complex trade-offs between performance and programmability. High-performance implementations typically rely on low-level HIP/CUDA communication libraries that demand substantial engineering effort for even basic overlap patterns, while simpler abstractions often sacrifice performance. We present Iris, a multi-GPU communication library implemented entirely in Python and Triton that eliminates this trade-off. Iris provides tile-based symmetric memory abstractions that naturally align with Triton's programming model, enabling developers to write single-source kernels that seamlessly interleave computation and communication. We demonstrate a taxonomy of compute-communication overlap patterns--from bulk-synchronous to fine-grained workgroup specialization--that can be implemented with minimal code changes in Iris, often requiring just a few additional lines within the same Triton kernel. Our evaluation shows that Iris achieves near-optimal bandwidth utilization in microbenchmarks and delivers up to 1.79x speedup over PyTorch and RCCL for GEMM+All-Scatter workloads, demonstrating that high-level implementations can match or exceed heavily-optimized libraries while dramatically simplifying multi-GPU programming.
arXiv.org Artificial Intelligence
Nov-18-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- California > Santa Clara County
- Palo Alto (0.04)
- Santa Clara (0.04)
- Texas > Travis County
- Austin (0.04)
- California > Santa Clara County
- Asia > Middle East
- Genre:
- Research Report (0.64)
- Technology:
- Information Technology
- Artificial Intelligence > Machine Learning
- Neural Networks > Deep Learning (0.50)
- Graphics (1.00)
- Hardware (1.00)
- Software > Programming Languages (1.00)
- Artificial Intelligence > Machine Learning
- Information Technology