over-parameterization
Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient
A recent work by Malach et al. [MYSS20] establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random one that is a factor $O(d^4 l^2)$ wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep. Our analysis heavily relies on connecting pruning random ReLU networks to random instances of the Subset Sum problem. We then show that this logarithmic over-parameterization is essentially optimal for constant depth networks. Finally, we verify several of our theoretical insights with experiments.
Improving Adaptivity via Over-Parameterization in Sequence Models
It is well known that eigenfunctions of a kernel play a crucial role in kernel regression. Through several examples, we demonstrate that even with the same set of eigenfunctions, the order of these functions significantly impacts regression outcomes. Simplifying the model by diagonalizing the kernel, we introduce an over-parameterized gradient descent in the realm of sequence model to capture the effects of various orders of a fixed set of eigen-functions. This method is designed to explore the impact of varying eigenfunction orders. Our theoretical results show that the over-parameterization gradient flow can adapt to the underlying structure of the signal and significantly outperform the vanilla gradient flow method.
Review for NeurIPS paper: Neural Networks Learning and Memorization with (almost) no Over-Parameterization
Weaknesses: One of my concerns is the rigorousness of the paper. A key lemma, namely Lemma 12 in the supplementary material is only given with a proof sketch. Moreover, in the proof sketch, how the authors handle the general M-decent activation functions is discussed very ambiguously. This makes the results for ReLU activation function particularly questionable. The significance and novelty of this paper compared with the existing results are also not fully demonstrated. It is claimed in this paper that a tight analysis is given on the convergence of NTK to its expectations.
Review for NeurIPS paper: Neural Networks Learning and Memorization with (almost) no Over-Parameterization
This paper studies optimization in the NTK regime, further improving the best prior width bounds for random data (I believe Oymak-Soltanolkotabi were the prior best). The reviewers and I were all favorable, and I look forward to seeing this paper appear, and support the authors in further investigations. Relatedly, this point was not sufficiently handled in the rebuttal, despite the rebuttal using less than half a page. Please consider such things in the future. One is by Roman Vershynin, and I believe Sebastien Bubeck and colleagues also had a paper on the "Baum" problem.
Neural Networks Learning and Memorization with (almost) no Over-Parameterization
Many results in recent years established polynomial time learnability of various models via neural networks algorithms (e.g. However, unless the model is linear separable \cite{brutzkus2018sgd}, or the activation is a polynomial \cite{ge2019mildly}, these results require very large networks -- much more than what is needed for the mere existence of a good predictor. In this paper we prove that SGD on depth two neural networks can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with {\em near optimal} network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with \tilde{O}\left(\frac{m}{d}\right) hidden neurons (and hence \tilde{O}(m) parameters) can memorize m random labeled points in \sphere {d-1} .
Global Convergence of Sub-gradient Method for Robust Matrix Recovery: Small Initialization, Noisy Measurements, and Over-parameterization
In this work, we study the performance of sub-gradient method (SubGM) on a natural nonconvex and nonsmooth formulation of low-rank matrix recovery with $\ell_1$-loss, where the goal is to recover a low-rank matrix from a limited number of measurements, a subset of which may be grossly corrupted with noise. We study a scenario where the rank of the true solution is unknown and over-estimated instead. The over-estimation of the rank gives rise to an over-parameterized model in which there are more degrees of freedom than needed. Such over-parameterization may lead to overfitting, or adversely affect the performance of the algorithm. We prove that a simple SubGM with small initialization is agnostic to both over-parameterization and noise in the measurements. In particular, we show that small initialization nullifies the effect of over-parameterization on the performance of SubGM, leading to an exponential improvement in its convergence rate. Moreover, we provide the first unifying framework for analyzing the behavior of SubGM under both outlier and Gaussian noise models, showing that SubGM converges to the true solution, even under arbitrarily large and arbitrarily dense noise values, and--perhaps surprisingly--even if the globally optimal solutions do not correspond to the ground truth. At the core of our results is a robust variant of restricted isometry property, called Sign-RIP, which controls the deviation of the sub-differential of the $\ell_1$-loss from that of an ideal, expected loss. As a byproduct of our results, we consider a subclass of robust low-rank matrix recovery with Gaussian measurements, and show that the number of required samples to guarantee the global convergence of SubGM is independent of the over-parameterized rank.
The Benefits of Over-parameterization at Initialization in Deep ReLU Networks
Arpit, Devansh, Bengio, Yoshua
It has been noted in existing literature that over-parameterization in ReLU networks generally leads to better performance. While there could be several reasons for this, we investigate desirable network properties at initialization which may be enjoyed by ReLU networks. Without making any assumption, we derive a lower bound on the layer width of deep ReLU networks whose weights are initialized from a certain distribution, such that with high probability, i) the norm of hidden activation of all layers are roughly equal to the norm of the input, and, ii) the norm of parameter gradient for all the layers are roughly the same. In this way, sufficiently wide deep ReLU nets with appropriate initialization can inherently preserve the forward flow of information and also avoid the gradient exploding/vanishing problem. We further show that these results hold for an infinite number of data samples, in which case the finite lower bound depends on the input dimensionality and the depth of the network. In the case of deep ReLU networks with weight vectors normalized by their norm, we derive an initialization required to tap the aforementioned benefits from over-parameterization without which network fails to learn for large depth.