Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Afzal, Ayesha, Hager, Georg, Wellein, Gerhard, Markidis, Stefano

arXiv.org Artificial Intelligence 

This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.