training graph neural network
Supplementary materials for Paper " Bandit Samplers for Training Graph Neural Networks "
We show the convergences on validation in terms of timing (seconds) in Figure 1 and Figure 2. Basically, our algorithms converge to much better results in nearly same duration compared with Note that we cannot complete the training of AS-GA T on Reddit because of memory issues. Note that the comparisons of timing between "graph sampling" and "layer sampling" paradigms have As a result, we do not compare the timing with "graph sampling" approaches. That is, graph sampling approaches are designed for graph data that all vertices have labels. To summarize, the "layer sampling" approaches are more flexible and general compared with "graph sampling" Before we give the proof of Theorem 1, we first prove the following Lemma 1 that will be used later.
Review for NeurIPS paper: Bandit Samplers for Training Graph Neural Networks
Weaknesses: The authors conduct experiments with 2 layer architecture. However, for few problems and dataset, it may require more complex architecture and the authors could clarify how the proposed algorithm performs in terms of scalability and computation cost. The notation used in the paper sometimes can be confusing. For example - in equation 4 - alpha_ij is a value not a function - alpha_ij(t) can be noted as function of't' It is mentioned that the rewards vary as the training proceeds and it would have been interesting to explore how any of simple bandit algorithms perform in the experiments or how to apply simple bandit algorithms for the current experiments. The authors could try and adapt a simple eps-greedy method to solve this problem The algorithm based on MAB where a combinatorial set of neighbors with size k needs to be chosen.
Review for NeurIPS paper: Bandit Samplers for Training Graph Neural Networks
There are several papers dealing with the issue of sampling when training a graph neural network, and this paper provides a new approach to it using (adversarial) bandit technology. This idea makes sense and the authors provide convincing arguments for why this technique should improve upon previous art. Although there may be room for slightly more comprehensive experiments and reasoning for the details of the proposed solution, the paper provides a clear representation of a novel method that works well for an important problem. This paper would make a good addition to NeurIPS.
Bandit Samplers for Training Graph Neural Networks
Several sampling algorithms with variance reduction have been proposed for accelerating the training of Graph Convolution Networks (GCNs). However, due to the intractable computation of optimal sampling distribution, these sampling algorithms are suboptimal for GCNs and are not applicable to more general graph neural networks (GNNs) where the message aggregator contains learned weights rather than fixed weights, such as Graph Attention Networks (GAT). The fundamental reason is that the embeddings of the neighbors or learned weights involved in the optimal sampling distribution are \emph{changing} during the training and \emph{not known a priori}, but only \emph{partially observed} when sampled, thus making the derivation of an optimal variance reduced samplers non-trivial. In this paper, we formulate the optimization of the sampling variance as an adversary bandit problem, where the rewards are related to the node embeddings and learned weights, and can vary constantly. Thus a good sampler needs to acquire variance information about more neighbors (exploration) while at the same time optimizing the immediate sampling variance (exploit). We theoretically show that our algorithm asymptotically approaches the optimal variance within a factor of 3. We show the efficiency and effectiveness of our approach on multiple datasets.
Training Graph Neural Networks with 1000 Layers
Li, Guohao, Müller, Matthias, Ghanem, Bernard, Koltun, Vladlen
Deep graph neural networks (GNNs) have achieved excellent results on various tasks on increasingly large graph datasets with millions of nodes and edges. However, memory complexity has become a major obstacle when training deep GNNs for practical applications due to the immense number of nodes, edges, and intermediate activations. To improve the scalability of GNNs, prior works propose smart graph sampling or partitioning strategies to train GNNs with a smaller set of nodes or sub-graphs. In this work, we study reversible connections, group convolutions, weight tying, and equilibrium models to advance the memory and parameter efficiency of GNNs. We find that reversible connections in combination with deep network architectures enable the training of overparameterized GNNs that significantly outperform existing methods on multiple datasets. Our models RevGNN-Deep (1001 layers with 80 channels each) and RevGNN-Wide (448 layers with 224 channels each) were both trained on a single commodity GPU and achieve an ROC-AUC of $87.74 \pm 0.13$ and $88.24 \pm 0.15$ on the ogbn-proteins dataset. To the best of our knowledge, RevGNN-Deep is the deepest GNN in the literature by one order of magnitude. Please visit our project website https://www.deepgcns.org/arch/gnn1000 for more information.