AITopics | wider network

3937230de3c8041e4da6ac3246a888e8-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 12:05:09 GMT

artificial intelligence, machine learning, robustness, (20 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

The Effect of Network Width on the Performance of Large-batch Training

Neural Information Processing SystemsMar-17-2026, 02:04:59 GMT

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.99)

Add feedback

Do Wider Neural Networks Really Help Adversarial Robustness?

Neural Information Processing SystemsDec-24-2025, 00:11:21 GMT

Adversarial training is a powerful type of defense against adversarial examples. Previous empirical results suggest that adversarial training requires wider networks for better performances. However, it remains elusive how does neural network width affect model robustness. In this paper, we carefully examine the relationship between network width and model robustness. Specifically, we show that the model robustness is closely related to the tradeoff between natural accuracy and perturbation stability, which is controlled by the robust regularization parameter λ.

help adversarial robustness, model robustness, perturbation stability, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback

The Effect of Network Width on the Performance of Large-batch Training

Neural Information Processing SystemsNov-20-2025, 23:11:25 GMT

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

large-batch training, name change, proceedings, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.99)

Add feedback

reviewers ' questions below and will incorporate feedback into the final revision

Neural Information Processing SystemsOct-3-2025, 03:54:26 GMT

We thank the reviewers for the detailed and insightful reviews. As the reviewers noted, our work 1) contributes to "a Thank you for the valuable feedback on this section -- we will incorporate this in our next revision. The intuition for the proof of Theorem 3.3 is that the optimization problem is convex over the space of probability By weak regularization, we refer to the fact that λ 0 for our Theorem 4.1 to hold. The difficulty with ReLU networks is that if the gradient flow pushes neurons towards 0, issues of differentiability arise. One potential approach to circumvent this issue is arguing that with correct initialization, the iterates will never reach 0. This is an interesting direction for future work and we thank the reviewer for this suggestion.

artificial intelligence, machine learning, reviewer, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Add feedback

Do Wider Neural Networks Really Help Adversarial Robustness?

Neural Information Processing SystemsOct-10-2024, 01:36:15 GMT

Adversarial training is a powerful type of defense against adversarial examples. Previous empirical results suggest that adversarial training requires wider networks for better performances. However, it remains elusive how does neural network width affect model robustness. In this paper, we carefully examine the relationship between network width and model robustness. Specifically, we show that the model robustness is closely related to the tradeoff between natural accuracy and perturbation stability, which is controlled by the robust regularization parameter λ.

help adversarial robustness, model robustness, perturbation stability, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.64)

Add feedback

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Thérien, Benjamin, Joseph, Charles-Étienne, Knyazev, Boris, Oyallon, Edouard, Rish, Irina, Belilovsky, Eugene

arXiv.org Artificial IntelligenceMay-31-2024

Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they often suffer from poor meta-generalization, especially when training networks larger than those seen during meta-training. To address this, we use the recently proposed Maximal Update Parametrization ($\mu$P), which allows zero-shot generalization of optimizer hyperparameters from smaller to larger models. We extend $\mu$P theory to learned optimizers, treating the meta-training problem as finding the learned optimizer under $\mu$P. Our evaluation shows that LOs meta-trained with $\mu$P substantially improve meta-generalization as compared to LOs trained under standard parametrization (SP). Notably, when applied to large-width models, our best $\mu$LO, trained for 103 GPU-hours, matches or exceeds the performance of VeLO, the largest publicly available learned optimizer, meta-trained with 4000 TPU-months of compute. Moreover, $\mu$LOs demonstrate better generalization than their SP counterparts to deeper networks and to much longer training horizons (25 times longer) than those seen during meta-training.

generalization, mlp, optimizer, (14 more...)

arXiv.org Artificial Intelligence

2406.00153

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)

Add feedback

A Neural Scaling Law from Lottery Ticket Ensembling

Liu, Ziming, Tegmark, Max

arXiv.org Machine LearningOct-3-2023

Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-\alpha}$, $\alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($\alpha=1$) from their predictions ($\alpha=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.

lottery ticket, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2310.02258

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Contests & Prizes (1.00)

Industry: Leisure & Entertainment > Gambling (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Wide Neural Networks Forget Less Catastrophically

Mirzadeh, Seyed Iman, Chaudhry, Arslan, Hu, Huiyi, Pascanu, Razvan, Gorur, Dilan, Farajtabar, Mehrdad

arXiv.org Artificial IntelligenceOct-21-2021

Machine learning is relying more and more on training large models on large static datasets to reach impressive results (Kaplan et al., 2020; Lazaridou et al., 2021; Hombaiah et al., 2021). However, the real world is changing over time and new information is becoming available at an unprecedented rate (Lazaridou et al., 2021; Hombaiah et al., 2021). In such real world problems, the learning agent is exposed to a continuous stream of data, with potentially changing data distribution, and it has to absorb new information efficiently while not being able to iterate on previous data as freely as wanted due to time, sample, compute, privacy, or environmental complexity issues (Parisi et al., 2018). To overcome these inefficiencies, fields, such as Continual learning (CL) (Ring et al., 1994) or lifelong learning (Thrun, 1995) are gaining a lot of attention recently. One of the key challenges in continual learning models is the abrupt erasure of previous knowledge, referred to as Catastrophic Forgetting (CF) (McCloskey and Cohen, 1989). Alleviating catastrophic forgetting has attracted a lot of attention lately, and many interesting solutions are proposed to partly overcome the issue (e.g., Toneva et al., 2018; Nguyen et al., 2019; Hsu et al., 2018; Li et al., 2019; Wallingford et al., 2020). These solutions vary in degree of complexity from simple replay-based methods to complicated regularization or network expansion-based methods. Unfortunately, however, there is not much fundamental understanding of the intrinsic properties of neural networks that affects continual learning performance through catastrophic forgetting or forward/backward transfer (Mirzadeh et al., 2020). Work done during an internship at DeepMind.

continual learning, gradient, neural network, (11 more...)

arXiv.org Artificial Intelligence

2110.11526

Country: North America > United States > Washington (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Education (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

The Effect of Network Width on the Performance of Large-batch Training

Chen, Lingjiao, Wang, Hongyi, Zhao, Jinman, Papailiopoulos, Dimitris, Koutris, Paraschos

Neural Information Processing SystemsFeb-14-2020, 20:42:14 GMT

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

large-batch training, neural network, wider network, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback