Goto

Collaborating Authors

 sgn


ec4f0b0a7557d6a51c42308800f2c23a-Supplemental-Conference.pdf

Neural Information Processing Systems

Let (x,y)be a binary classification task that admits a smooth separator as in Assumption 1. Then, there exists an RLC with neural network fθ and absolutely continuous randomness source u (Assumption 2) that is universal in the limit, i.e., Fθ (x) = y(x), x X, and makes random predictions that are correct with probability P(maj({sgn( a Further, if p is the number of parameters used by a deterministic neural network with one hidden layer to achieve zero-error in the task, fθ has at most p p +O(1)parameters. Since Assumption 1 holds3, there exists a single hidden-layer neural network N that, like s, achieves zero-error in this task [8]. Further, since sgn is nonpolynomial, we can use it as the nonlinearity of this network [21]. Putting it all together, there exists a number of hidden units M and parameters bj,oj R,wj Rd for j = 1,...,M such that N(x):= Note that this means we can achieve zero-error in classification, N(x) = y(x), x X.


Faster Directional Convergence of Linear Neural Networks under Spherically Symmetric Data

Neural Information Processing Systems

In this paper, we study gradient methods for training deep linear neural networks with binary cross-entropy loss. In particular, we show global directional convergence guarantees from a polynomial rate to a linear rate for (deep) linear networks with spherically symmetric data distribution, which can be viewed as a specific zero-margin dataset. Our results do not require the assumptions in other works such as small initial loss, presumed convergence of weight direction, or overparameterization. We also characterize our findings in experiments.









A Proofs

Neural Information Processing Systems

Further taking the usual assumption that X is compact. Let us start with Proposition 3, a central observation needed in Theorem 2. Put into words Now, we can proceed to prove the universality part of Theorem 2. Since the task admits a smooth separator, By Fubini's theorem and Proposition 3, we have F The reader can think of λ as a uniform distribution over G. (as in Theorem 2). The result follows directly from the combination of de Finetti's theorem [ Combining this with Kallenberg's noise transfer theorem we have that the weights and Assumption 1 or ii) is an inner-product decision graph problem as in Definition 3. Further, the task has infinitely (as in Theorem 2). Finally, we follow Proposition 2's proof by simply replacing de Finetti's with Aldous-Hoover's theorem. Define an RLC that samples the linear coefficients as follows.