Goto

Collaborating Authors

 2-layer neural network




Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model

Neural Information Processing Systems

This paper studies two-layers Neural Networks (NN), where the first layer contains random weights, and the second layer is trained using Ridge regularization. This model has been the focus of numerous recent works, showing that despite its simplicity, it captures some of the empirically observed behaviors of NN in the overparametrized regime, such as the double-descent curve where the generalization error decreases as the number of weights increases to $+\infty$. This paper establishes asymptotic distribution results for this 2-layers NN model in the regime where the ratios $\frac p n$ and $\frac d n$ have finite limits, where $n$ is the sample size, $p$ the ambient dimension and $d$ is the width of the first layer. We show that a weighted average of the derivatives of the trained NN at the observed data is asymptotically normal, in a setting with Lipschitz activation functions in a linear regression response with Gaussian features under possibly non-linear perturbations. We then leverage this asymptotic normality result to construct confidence intervals (CIs) for single components of the unknown regression vector. The novelty of our results are threefold: (1) Despite the nonlinearity induced by the activation function, we characterize the asymptotic distribution of a weighted average of the gradients of the network after training; (2) It provides the first frequentist uncertainty quantification guarantees, in the form of valid ($1\text{-}\alpha$)-CIs, based on NN estimates; (3) It shows that the double-descent phenomenon occurs in terms of the length of the CIs, with the length increasing and then decreasing as $\frac d n\nearrow +\infty$ for certain fixed values of $\frac p n$. We also provide a toolbox to predict the length of CIs numerically, which lets us compare activation functions and other parameters in terms of CI length.


A More Analysis

Neural Information Processing Systems

This section describes how the objective for the encoder, model, and policy (Eq. The remaining difference between this objective and Eq. 5 is that the Q value term is scaled by This prior cannot be predicted from prior observations. Maximum entropy (MaxEnt) RL is a special case of our compression objective. In practice we perform gradient steps using the Adam [24] optimizer. An optimal agent must balance these information costs against the value of information gained from these observations.



Review for NeurIPS paper: Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model

Neural Information Processing Systems

The reviewers point out that this is a borderline submission. They reasonably questions several things in the paper: - it is not clear why the coefficients for which the CLT holds for are important; - assumptions are restrictive; - the paper studies too simplistic of a model; - parts of the analysis are unclear; - writing is done hastily with typos lingering around. After my own reading, I agree with these comments. On the other hand, the reviewers also point out that there are certain aspects of double descent that are not previously explored, which are of more interest compared to the confidence intervals. My opinion is that the paper would be much stronger if the cons were addressed in a revised manuscript.


Review for NeurIPS paper: Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model

Neural Information Processing Systems

Additional Feedback: I will increase my score if my concerns are addressed and if the authors could correct my potential misunderstanding. 1. I find the "double descent" phenomenon in the CL length to be interesting. Intuitively, the uncertainty of the model could relate to the variance of the prediction, which we know might blow up at the interpolation threshold due to the variance from label noise or from initialization. Can the author comment on the plausible mechanism of this observation? In this case what would be the motivation of considering a nonlinear perturbation, which would basically be adding noise? 3. The result in Section 2.4 (based on Mei and Montanari 2019) seems to be under the assumption of iid weight matrix W. I might have missed something, but is there a place the authors discussed that this characterization also holds for arbitrary W (independent of X) with bounded spectral norm? 4. (minor) Does the characterization also holds for the ridgeless limit (\lambda 0)? 5. (minor) On Figure 2 Left, why is there a discrepancy between the predicted and simulated boxplot? 6. (minor) Although this is not the motivation of the work, the mentioned connection between NN and RF model typically requires significant overparameterization, and thus the current proportional scaling of n and d might not be the right setup.


Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model

Neural Information Processing Systems

This paper studies two-layers Neural Networks (NN), where the first layer contains random weights, and the second layer is trained using Ridge regularization. This model has been the focus of numerous recent works, showing that despite its simplicity, it captures some of the empirically observed behaviors of NN in the overparametrized regime, such as the double-descent curve where the generalization error decreases as the number of weights increases to \infty . This paper establishes asymptotic distribution results for this 2-layers NN model in the regime where the ratios \frac p n and \frac d n have finite limits, where n is the sample size, p the ambient dimension and d is the width of the first layer. We show that a weighted average of the derivatives of the trained NN at the observed data is asymptotically normal, in a setting with Lipschitz activation functions in a linear regression response with Gaussian features under possibly non-linear perturbations. We then leverage this asymptotic normality result to construct confidence intervals (CIs) for single components of the unknown regression vector.


Proving the Lottery Ticket Hypothesis: Pruning is All You Need

Malach, Eran, Yehudai, Gilad, Shalev-Shwartz, Shai, Shamir, Ohad

arXiv.org Machine Learning

Neural network pruning is a popular method to reduce the size of a trained model, allowing efficient computation during inference time, with minimal loss in accura cy. However, such a method still requires the process of training an over-parameterized network, as trai ning a pruned network from scratch seems to fail (see [ 10 ]). Recently, a work by Frankle and Carbin [ 10 ] has presented a surprising phenomenon: pruned neural networks can be trained to achieve good performance, when resetting their weights to their initial values. Hence, the authors state the lottery ticket hypothesis: a randomly-initialized neural network contains a subnetwork such that, when trained in isolation, can match the performance of the original network. This observation has attracted great interest, with variou s followup works trying to understand this intriguing phenomenon. Specifically, very recent works by Z hou et al. [ 37 ], Ramanujan et al. [ 27 ] presented algorithms to find subnetworks that already achieve good per formance, without any training.


Shallow Neural Network can Perfectly Classify an Object following Separable Probability Distribution

Min, Youngjae, Chung, Hye Won

arXiv.org Machine Learning

Guiding the design of neural networks is of great importance to save enormous resources consumed on empirical decisions of architectural parameters. This paper constructs shallow sigmoid-type neural networks that achieve 100% accuracy in classification for datasets following a linear separability condition. The separability condition in this work is more relaxed than the widely used linear separability. Moreover, the constructed neural network guarantees perfect classification for any datasets sampled from a separable probability distribution. This generalization capability comes from the saturation of sigmoid function that exploits small margins near the boundaries of intervals formed by the separable probability distribution.