Goto

Collaborating Authors

 relu


A Proof of Proposition 2.5

Neural Information Processing Systems

Proposition 2.5 is a direct consequence of the following lemma (remember that Lemma A.1 (Smooth functions conserved through a given flow.) . Assume that @h () ()=0 for all 2 . Let us first show the direct inclusion. Now let us show the converse inclusion. We recall (cf Example 2.10 and Example 2.11) that linear and Assumption 2.9, which we recall reads as: Theorem 2.14, let us show that (9) holds for standard ML losses.



Appendix 446 A Proof of Proposition 1 in Section 2 447 Proof

Neural Information Processing Systems

ReLU (T (v u) + b) = ReLU( Tv + b), where u = 0, that is, ReLU (T + b) is not injective. By injectivity of T, we finally get a = b . Remark 2. An example that satisfies (3.1) is the neural operator whose This construction is given by the combination of "Pairs of projections" discussed in Kato [2013, Section I.4.6] with the idea presented in [Puthawala et al., 2022b, Lemma 29]. R. We write operator null G by Thus, in both cases, H is injective. Remark 4. W e make the following observations using Theorem 1: Leaky ReLU is one of example that satisfies (ii) in Theorem 1. Puthawala et al. [2022a, Theorem 15] assumes that We first revisit layerwise injectivity and bijectivity in the case of the finite rank approximation.





A Detailed comparisons with related work

Neural Information Processing Systems

In Table 1, we compare our agnostic learning results. Our results in this setting come from Theorem 3.3. We note that the sample complexity for Diakonikolas et al. To prove Lemma 3.5, we use the following result of Y ehudai and Shamir [35]. We first consider the case when σ satisfies Assumption 3.1.


Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers

Ching, Michelle, Popescu, Ioana, Smith, Nico, Ma, Tianyi, Underwood, William G., Samworth, Richard J.

arXiv.org Machine Learning

We study in-context learning for nonparametric regression with $α$-Hölder smooth regression functions, for some $α>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $Θ(\log n)$ parameters and $Ω\bigl(n^{2α/(2α+d)}\log^3 n\bigr)$ pretraining sequences can achieve the minimax-optimal rate of convergence $O\bigl(n^{-2α/(2α+d)}\bigr)$ in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.


Invertibility of Convolutional Generative Networks from Partial Measurements

Neural Information Processing Systems

In this work, we present new theoretical results on convolutional generative neural networks, in particular their invertibility (i.e., the recovery of input latent code given the network output). The study of network inversion problem is motivated by image inpainting and the mode collapse problem in training GAN. Network inversion is highly non-convex, and thus is typically computationally intractable and without optimality guarantees. However, we rigorously prove that, under some mild technical assumptions, the input of a two-layer convolutional generative network can be deduced from the network output efficiently using simple gradient descent. This new theoretical finding implies that the mapping from the low-dimensional latent space to the high-dimensional image space is bijective (i.e., one-to-one). In addition, the same conclusion holds even when the network output is only partially observed (i.e., with missing pixels). Our theorems hold for 2-layer convolutional generative network with ReLU as the activation function, but we demonstrate empirically that the same conclusion extends to multi-layer networks and networks with other activation functions, including the leaky ReLU, sigmoid and tanh.


Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks

Neural Information Processing Systems

We study the sample complexity of learning one-hidden-layer convolutional neural networks (CNNs) with non-overlapping filters. We propose a novel algorithm called approximate gradient descent for training CNNs, and show that, with high probability, the proposed algorithm with random initialization grants a linear convergence to the ground-truth parameters up to statistical precision. Compared with existing work, our result applies to general non-trivial, monotonic and Lipschitz continuous activation functions including ReLU, Leaky ReLU, Sigmod and Softplus etc. Moreover, our sample complexity beats existing results in the dependency of the number of hidden nodes and filter size. In fact, our result matches the information-theoretic lower bound for learning one-hidden-layer CNNs with linear activation functions, suggesting that our sample complexity is tight. Our theoretical analysis is backed up by numerical experiments.