MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs
Sarkar, Aishwarya, Ghosh, Sayan, Tallent, Nathan R., Jannesari, Ali
–arXiv.org Artificial Intelligence
Graph Neural Networks (GNN) are indispensable in learning from graph-structured data, yet their rising computational costs, especially on massively connected graphs, pose significant challenges in terms of execution performance. To tackle this, distributed-memory solutions such as partitioning the graph to concurrently train multiple replicas of GNNs are in practice. However, approaches requiring a partitioned graph usually suffer from communication overhead and load imbalance, even under optimal partitioning and communication strategies due to irregularities in the neighborhood minibatch sampling. This paper proposes practical trade-offs for improving the sampling and communication overheads for representation learning on distributed graphs (using popular GraphSAGE architecture) by developing a parameterized continuous prefetch and eviction scheme on top of the state-of-the-art Amazon DistDGL distributed GNN framework, demonstrating about 15-40% improvement in end-to-end training performance on the National Energy Research Scientific Computing Center's (NERSC) Perlmutter supercomputer for various OGB datasets.
arXiv.org Artificial Intelligence
Nov-3-2024
- Country:
- North America > United States > Iowa (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Energy (0.34)
- Government (0.46)
- Technology: