Goto

Collaborating Authors

 train-by-reconnect


Train-by-Reconnect: Decoupling Locations of Weights from Their Values

Neural Information Processing Systems

What makes untrained deep neural networks (DNNs) different from the trained performant ones? By zooming into the weights in well-trained DNNs, we found that it is the location of weights that holds most of the information encoded by the training. Motivated by this observation, we hypothesized that weights in DNNs trained using stochastic gradient-based methods can be separated into two dimensions: the location of weights, and their exact values. To assess our hypothesis, we propose a novel method called lookahead permutation (LaPerm) to train DNNs by reconnecting the weights. We empirically demonstrate LaPerm's versatility while producing extensive evidence to support our hypothesis: when the initial weights are random and dense, our method demonstrates speed and performance similar to or better than that of regular optimizers, e.g., Adam. When the initial weights are random and sparse (many zeros), our method changes the way neurons connect, achieving accuracy comparable to that of a well-trained dense network. When the initial weights share a single value, our method finds a weight agnostic neural network with far-better-than-chance accuracy.


Review for NeurIPS paper: Train-by-Reconnect: Decoupling Locations of Weights from Their Values

Neural Information Processing Systems

Weaknesses: * I am not sure how novel or meaningful the analysis of "weight profiles" is in Section 2. Checking the provided code, the weight profiles in Figures 1 and 2 are plotted for the weights in an ImageNet-pretrained model as: vgg16 tf.keras.applications.vgg16.VGG16(include_top True, weights "imagenet") It would be important to know what hyperparameters were used in the training script for the pre-trained models. It is likely that the weight initialization was Gaussian, and that weight decay was used for regularization. Then the distribution of weights in the trained model may not differ too greatly from the initial distribution (e.g., still roughly Gaussian). One can obtain similar plots to Figure 2 by sorting random Gaussian samples: samples np.random.normal(size Alternatively, there are many distributions other than Gaussians that could potentially yield similar heavy-tailed plots as Figures 1 and 2. A relevant paper looking at the distributions of trained network weights is [1].


Train-by-Reconnect: Decoupling Locations of Weights from Their Values

Neural Information Processing Systems

What makes untrained deep neural networks (DNNs) different from the trained performant ones? By zooming into the weights in well-trained DNNs, we found that it is the location of weights that holds most of the information encoded by the training. Motivated by this observation, we hypothesized that weights in DNNs trained using stochastic gradient-based methods can be separated into two dimensions: the location of weights, and their exact values. To assess our hypothesis, we propose a novel method called lookahead permutation (LaPerm) to train DNNs by reconnecting the weights. We empirically demonstrate LaPerm's versatility while producing extensive evidence to support our hypothesis: when the initial weights are random and dense, our method demonstrates speed and performance similar to or better than that of regular optimizers, e.g., Adam.