approximate description length
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Reviews: Generalization Bounds for Neural Networks via Approximate Description Length
In this paper the authors establish upper bounds on the generalization error of classes of norm-bounded neural networks. There is a long line of literature on this exact question, and this paper claims to resolve an interesting open question in this area (at least when the depth of the network is viewed as a constant). In particular, the paper considers generalization bounds for a class of fully-connected networks of constant depth and whose matrices are of bounded norm. Work by Bartlett et al. ("Spectrally normalized margin bounds on neural networks", ref [4] in the paper) proved an upper bound on generalization error that contains a factor growing as the (1,2)-matrix norm of any layer. If one further assumes that the depth as well as all the spectral norms are constants, then this is the dominant term (up to logarithmic factors) in their generalization bound.
Reviews: Generalization Bounds for Neural Networks via Approximate Description Length
This paper proposes a new framework for bounding the generalization error of fully connected neural nets. The authors are able to show that, for sufficiently smooth activation functions, the number of examples required to achieve a good generalization error scales sublinearly with the total number of parameters in the network. This is a significantly better bound than the previous state-of-the-art results. The analytical tools based on description length are very interesting, and could be applicable to the analysis of other multi-layer non-convex models. All three reviewers are uniformly enthusiastic about this work, which is guaranteed to attract a great deal of attention and to catalyze further research activity.
Generalization Bounds for Neural Networks via Approximate Description Length
We investigate the sample complexity of networks with bounds on the magnitude of its weights. We show that for any depth t, if the inputs are in [-1,1] d, the sample complexity of \cn is \tilde O\left(\frac{dR 2}{\epsilon 2}\right) . This bound is optimal up to log-factors, and substantially improves over the previous state of the art of \tilde O\left(\frac{d 2R 2}{\epsilon 2}\right), that was established in a recent line of work. We furthermore show that this bound remains valid if instead of considering the magnitude of the W_i's, we consider the magnitude of W_i - W_i 0, where W_i 0 are some reference matrices, with spectral norm of O(1) . By taking the W_i 0 to be the matrices in the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many {\em typical} regimes of parameters.
Approximate Description Length, Covering Numbers, and VC Dimension
Daniely, Amit, Katzhendler, Gal
Neural Networks are a widely used tool nowadays, despite the lack of theoretical background supporting their abilities to generalize well. Classical notions of learning guarantee generalization only if there are more examples that parameters. It is clear that a stronger assumption is needed to achieve tighter bounds, and indeed, different types of assumptions were used in order to fill this empirical-theoretical gap, including assumptions on robustness to noise [2], bias of the learning algorithm [5, 10], and norm bounds on the weight's matrices [8, 9] The idea of Approximate Description Length [4] was conceived as a part of the line of research working under assumptions that bound the magnitude of the network's weight matrices.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- Asia > Middle East > Israel (0.04)
Generalization Bounds for Neural Networks via Approximate Description Length
We investigate the sample complexity of networks with bounds on the magnitude of its weights. This bound is optimal up to log-factors, and substantially improves over the previous state of the art of $\tilde O\left(\frac{d 2R 2}{\epsilon 2}\right)$, that was established in a recent line of work. We furthermore show that this bound remains valid if instead of considering the magnitude of the $W_i$'s, we consider the magnitude of $W_i - W_i 0$, where $W_i 0$ are some reference matrices, with spectral norm of $O(1)$. By taking the $W_i 0$ to be the matrices in the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many {\em typical} regimes of parameters. To establish our results we develop a new technique to analyze the sample complexity of families $\ch$ of predictors.
Generalization Bounds for Neural Networks via Approximate Description Length
We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class \[ H=\left\{W_t\circ\rho\circ \ldots\circ\rho\circ W_{1} :W_1,\ldots,W_{t-1}\in M_{d, d}, W_t\in M_{1,d}\right\} \] where the spectral norm of each $W_i$ is bounded by $O(1)$, the Frobenius norm is bounded by $R$, and $\rho$ is the sigmoid function $\frac{e^x}{1+e^x}$ or the smoothened ReLU function $ \ln (1+e^x)$. We show that for any depth $t$, if the inputs are in $[-1,1]^d$, the sample complexity of $H$ is $\tilde O\left(\frac{dR^2}{\epsilon^2}\right)$. This bound is optimal up to log-factors, and substantially improves over the previous state of the art of $\tilde O\left(\frac{d^2R^2}{\epsilon^2}\right)$. We furthermore show that this bound remains valid if instead of considering the magnitude of the $W_i$'s, we consider the magnitude of $W_i - W_i^0$, where $W_i^0$ are some reference matrices, with spectral norm of $O(1)$. By taking the $W_i^0$ to be the matrices at the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many typical regimes of parameters. To establish our results we develop a new technique to analyze the sample complexity of families $H$ of predictors. We start by defining a new notion of a randomized approximate description of functions $f:X\to\mathbb{R}^d$. We then show that if there is a way to approximately describe functions in a class $H$ using $d$ bits, then $d/\epsilon^2$ examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is $\epsilon$-close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and non-linear functions.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)