Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications
Merzky, Andre, Titov, Mikhail, Turilli, Matteo, Kilic, Ozgur, Wang, Tianle, Jha, Shantenu
–arXiv.org Artificial Intelligence
Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling across local and remote platforms. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads. This lays the foundation for prototyping three representative data-driven workflow applications and executing them at scale on leadership-class HPC platforms.
arXiv.org Artificial Intelligence
Mar-17-2025
- Country:
- North America > United States > New Jersey > Middlesex County > Piscataway (0.14)
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Health & Medicine (1.00)
- Technology: