wider network
The Effect of Network Width on the Performance of Large-batch Training
Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.
Do Wider Neural Networks Really Help Adversarial Robustness?
Adversarial training is a powerful type of defense against adversarial examples. Previous empirical results suggest that adversarial training requires wider networks for better performances. However, it remains elusive how does neural network width affect model robustness. In this paper, we carefully examine the relationship between network width and model robustness. Specifically, we show that the model robustness is closely related to the tradeoff between natural accuracy and perturbation stability, which is controlled by the robust regularization parameter ฮป.
The Effect of Network Width on the Performance of Large-batch Training
Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.
reviewers ' questions below and will incorporate feedback into the final revision
We thank the reviewers for the detailed and insightful reviews. As the reviewers noted, our work 1) contributes to "a Thank you for the valuable feedback on this section -- we will incorporate this in our next revision. The intuition for the proof of Theorem 3.3 is that the optimization problem is convex over the space of probability By weak regularization, we refer to the fact that ฮป 0 for our Theorem 4.1 to hold. The difficulty with ReLU networks is that if the gradient flow pushes neurons towards 0, issues of differentiability arise. One potential approach to circumvent this issue is arguing that with correct initialization, the iterates will never reach 0. This is an interesting direction for future work and we thank the reviewer for this suggestion.
Do Wider Neural Networks Really Help Adversarial Robustness?
Adversarial training is a powerful type of defense against adversarial examples. Previous empirical results suggest that adversarial training requires wider networks for better performances. However, it remains elusive how does neural network width affect model robustness. In this paper, we carefully examine the relationship between network width and model robustness. Specifically, we show that the model robustness is closely related to the tradeoff between natural accuracy and perturbation stability, which is controlled by the robust regularization parameter ฮป.
$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
Thรฉrien, Benjamin, Joseph, Charles-รtienne, Knyazev, Boris, Oyallon, Edouard, Rish, Irina, Belilovsky, Eugene
Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they often suffer from poor meta-generalization, especially when training networks larger than those seen during meta-training. To address this, we use the recently proposed Maximal Update Parametrization ($\mu$P), which allows zero-shot generalization of optimizer hyperparameters from smaller to larger models. We extend $\mu$P theory to learned optimizers, treating the meta-training problem as finding the learned optimizer under $\mu$P. Our evaluation shows that LOs meta-trained with $\mu$P substantially improve meta-generalization as compared to LOs trained under standard parametrization (SP). Notably, when applied to large-width models, our best $\mu$LO, trained for 103 GPU-hours, matches or exceeds the performance of VeLO, the largest publicly available learned optimizer, meta-trained with 4000 TPU-months of compute. Moreover, $\mu$LOs demonstrate better generalization than their SP counterparts to deeper networks and to much longer training horizons (25 times longer) than those seen during meta-training.
A Neural Scaling Law from Lottery Ticket Ensembling
Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-\alpha}$, $\alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($\alpha=1$) from their predictions ($\alpha=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.
Wide Neural Networks Forget Less Catastrophically
Mirzadeh, Seyed Iman, Chaudhry, Arslan, Hu, Huiyi, Pascanu, Razvan, Gorur, Dilan, Farajtabar, Mehrdad
Machine learning is relying more and more on training large models on large static datasets to reach impressive results (Kaplan et al., 2020; Lazaridou et al., 2021; Hombaiah et al., 2021). However, the real world is changing over time and new information is becoming available at an unprecedented rate (Lazaridou et al., 2021; Hombaiah et al., 2021). In such real world problems, the learning agent is exposed to a continuous stream of data, with potentially changing data distribution, and it has to absorb new information efficiently while not being able to iterate on previous data as freely as wanted due to time, sample, compute, privacy, or environmental complexity issues (Parisi et al., 2018). To overcome these inefficiencies, fields, such as Continual learning (CL) (Ring et al., 1994) or lifelong learning (Thrun, 1995) are gaining a lot of attention recently. One of the key challenges in continual learning models is the abrupt erasure of previous knowledge, referred to as Catastrophic Forgetting (CF) (McCloskey and Cohen, 1989). Alleviating catastrophic forgetting has attracted a lot of attention lately, and many interesting solutions are proposed to partly overcome the issue (e.g., Toneva et al., 2018; Nguyen et al., 2019; Hsu et al., 2018; Li et al., 2019; Wallingford et al., 2020). These solutions vary in degree of complexity from simple replay-based methods to complicated regularization or network expansion-based methods. Unfortunately, however, there is not much fundamental understanding of the intrinsic properties of neural networks that affects continual learning performance through catastrophic forgetting or forward/backward transfer (Mirzadeh et al., 2020). Work done during an internship at DeepMind.
The Effect of Network Width on the Performance of Large-batch Training
Chen, Lingjiao, Wang, Hongyi, Zhao, Jinman, Papailiopoulos, Dimitris, Koutris, Paraschos
Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.