RSN: Randomized Subspace Newton
We develop a randomized Newton method capable of solving learning problems with huge dimensional feature spaces, which is a common setting in applications such as medical imaging, genomics and seismology. Our method leverages randomized sketching in a new way, by finding the Newton direction constrained to the space spanned by a random sketch. We develop a simple global linear convergence theory that holds for practically all sketching techniques, which gives the practitioners the freedom to design custom sketching approaches suitable for particular applications. We perform numerical experiments which demonstrate the efficiency of our method as compared to accelerated gradient descent and the full Newton method. Our method can be seen as a refinement and randomized extension of the results of Karimireddy, Stich, and Jaggi [18].
bc6dc48b743dc5d013b1abaebd2faed2-AuthorFeedback.pdf
Dear reviewers, thank you for taking the time to review our paper. All issues raised are easy to address. We will incorporate all of your suggestions. First, they are simply different algorithms. We achieve this by entirely bypassing the theory of one shot sketches, showing it is not at all necessary. We will include this discussion in the paper.
Insights on representational similarity in neural networks with canonical correlation
Ari Morcos, Maithra Raghu, Samy Bengio
Comparing different neural network representations and determining how representations evolve over time remain challenging open questions in our understanding of the function of neural networks. Comparing representations in neural networks is fundamentally difficult as the structure of representations varies greatly, even across groups of networks trained on identical tasks, and over the course of training. Here, we develop projection weighted CCA (Canonical Correlation Analysis) as a tool for understanding neural networks, building off of SVCCA, a recently proposed method [22]. We first improve the core method, showing how to differentiate between signal and noise, and then apply this technique to compare across a group of CNNs, demonstrating that networks which generalize converge to more similar representations than networks which memorize, that wider networks converge to more similar solutions than narrow networks, and that trained networks with identical topology but different learning rates converge to distinct clusters with diverse representations. We also investigate the representational dynamics of RNNs, across both training and sequential timesteps, finding that RNNs converge in a bottom-up pattern over the course of training and that the hidden state is highly variable over the course of a sequence, even when accounting for linear transforms. Together, these results provide new insights into the function of CNNs and RNNs, and demonstrate the utility of using CCA to understand representations.
Unifying Homophily and Heterophily for Spectral Graph Neural Networks via Triple Filter Ensembles
Polynomial-based learnable spectral graph neural networks (GNNs) utilize polynomial to approximate graph convolutions and have achieved impressive performance on graphs. Nevertheless, there are three progressive problems to be solved. Some models use polynomials with better approximation for approximating filters, yet perform worse on real-world graphs. Carefully crafted graph learning methods, sophisticated polynomial approximations, and refined coefficient constraints leaded to overfitting, which diminishes the generalization of the models. How to design a model that retains the ability of polynomial-based spectral GNNs to approximate filters while it possesses higher generalization and performance? In this paper, we propose a spectral GNN with triple filter ensemble (TFE-GNN), which extracts homophily and heterophily from graphs with different levels of homophily adaptively while utilizing the initial features.
A Projection-free Algorithm for Constrained Stochastic Multi-level Composition Optimization
We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of T functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard unbiasedness and second-moment assumptions.
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Per-example gradient norms are a vital ingredient for estimating gradient noise scale (GNS) with minimal variance. Observing the tensor contractions required to compute them, we propose a method with minimal FLOPs in 3D or greater tensor regimes by simultaneously computing the norms while computing the parameter gradients. Using this method we are able to observe the GNS of different layers at higher accuracy than previously possible. We find that the total GNS of contemporary transformer models is predicted well by the GNS of only the normalization layers. As a result, focusing only on the normalization layer, we develop a custom kernel to compute the per-example gradient norms while performing the Layer-Norm backward pass with zero throughput overhead. Tracking GNS on only those layers, we are able to guide a practical batch size schedule that reduces training time by 18% on a Chinchilla-optimal language model.