Not enough data to create a plot.
Try a different view from the menu above.
Balin, Riccardo
Scalable and Consistent Graph Neural Networks for Distributed Mesh-based Data-driven Modeling
Barwey, Shivam, Balin, Riccardo, Lusch, Bethany, Patel, Saumil, Balakrishnan, Ramesh, Pal, Pinaki, Maulik, Romit, Vishwanath, Venkatram
This work develops a distributed graph neural network (GNN) methodology for mesh-based modeling applications using a consistent neural message passing layer. As the name implies, the focus is on enabling scalable operations that satisfy physical consistency via halo nodes at sub-graph boundaries. Here, consistency refers to the fact that a GNN trained and evaluated on one rank (one large graph) is arithmetically equivalent to evaluations on multiple ranks (a partitioned graph). This concept is demonstrated by interfacing GNNs with NekRS, a GPU-capable exascale CFD solver developed at Argonne National Laboratory. It is shown how the NekRS mesh partitioning can be linked to the distributed GNN training and inference routines, resulting in a scalable mesh-based data-driven modeling workflow. We study the impact of consistency on the scalability of mesh-based GNNs, demonstrating efficient scaling in consistent GNNs for up to O(1B) graph nodes on the Frontier exascale supercomputer.
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD
Balin, Riccardo, Simini, Filippo, Simpson, Cooper, Shao, Andrew, Rigazzi, Alessandro, Ellis, Matthew, Becker, Stephen, Doostan, Alireza, Evans, John A., Jansen, Kenneth E.
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. Additionally, performing inference at runtime requires non-trivial coupling of ML framework libraries with simulation codes. This work offers a solution to both limitations by simplifying this coupling and enabling in situ training and inference workflows on heterogeneous clusters. Leveraging SmartSim, the presented framework deploys a database to store data and ML models in memory, thus circumventing the file system. On the Polaris supercomputer, we demonstrate perfect scaling efficiency to the full machine size of the data transfer and inference costs thanks to a novel co-located deployment of the database. Moreover, we train an autoencoder in situ from a turbulent flow simulation, showing that the framework overhead is negligible relative to a solver time step and training epoch.