relu
A Proof of Proposition 2.5
Proposition 2.5 is a direct consequence of the following lemma (remember that Lemma A.1 (Smooth functions conserved through a given flow.) . Assume that @h () ()=0 for all 2 . Let us first show the direct inclusion. Now let us show the converse inclusion. We recall (cf Example 2.10 and Example 2.11) that linear and Assumption 2.9, which we recall reads as: Theorem 2.14, let us show that (9) holds for standard ML losses.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Appendix 446 A Proof of Proposition 1 in Section 2 447 Proof
ReLU (T (v u) + b) = ReLU( Tv + b), where u = 0, that is, ReLU (T + b) is not injective. By injectivity of T, we finally get a = b . Remark 2. An example that satisfies (3.1) is the neural operator whose This construction is given by the combination of "Pairs of projections" discussed in Kato [2013, Section I.4.6] with the idea presented in [Puthawala et al., 2022b, Lemma 29]. R. We write operator null G by Thus, in both cases, H is injective. Remark 4. W e make the following observations using Theorem 1: Leaky ReLU is one of example that satisfies (ii) in Theorem 1. Puthawala et al. [2022a, Theorem 15] assumes that We first revisit layerwise injectivity and bijectivity in the case of the finite rank approximation.
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Asia > India > Tripura (0.04)
- North America > United States > South Dakota (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
A Detailed comparisons with related work
In Table 1, we compare our agnostic learning results. Our results in this setting come from Theorem 3.3. We note that the sample complexity for Diakonikolas et al. To prove Lemma 3.5, we use the following result of Y ehudai and Shamir [35]. We first consider the case when σ satisfies Assumption 3.1.
Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers
Ching, Michelle, Popescu, Ioana, Smith, Nico, Ma, Tianyi, Underwood, William G., Samworth, Richard J.
We study in-context learning for nonparametric regression with $α$-Hölder smooth regression functions, for some $α>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $Θ(\log n)$ parameters and $Ω\bigl(n^{2α/(2α+d)}\log^3 n\bigr)$ pretraining sequences can achieve the minimax-optimal rate of convergence $O\bigl(n^{-2α/(2α+d)}\bigr)$ in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Japan > Honshū > Kansai > Wakayama Prefecture > Wakayama (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.60)
Invertibility of Convolutional Generative Networks from Partial Measurements
In this work, we present new theoretical results on convolutional generative neural networks, in particular their invertibility (i.e., the recovery of input latent code given the network output). The study of network inversion problem is motivated by image inpainting and the mode collapse problem in training GAN. Network inversion is highly non-convex, and thus is typically computationally intractable and without optimality guarantees. However, we rigorously prove that, under some mild technical assumptions, the input of a two-layer convolutional generative network can be deduced from the network output efficiently using simple gradient descent. This new theoretical finding implies that the mapping from the low-dimensional latent space to the high-dimensional image space is bijective (i.e., one-to-one). In addition, the same conclusion holds even when the network output is only partially observed (i.e., with missing pixels). Our theorems hold for 2-layer convolutional generative network with ReLU as the activation function, but we demonstrate empirically that the same conclusion extends to multi-layer networks and networks with other activation functions, including the leaky ReLU, sigmoid and tanh.
Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks
We study the sample complexity of learning one-hidden-layer convolutional neural networks (CNNs) with non-overlapping filters. We propose a novel algorithm called approximate gradient descent for training CNNs, and show that, with high probability, the proposed algorithm with random initialization grants a linear convergence to the ground-truth parameters up to statistical precision. Compared with existing work, our result applies to general non-trivial, monotonic and Lipschitz continuous activation functions including ReLU, Leaky ReLU, Sigmod and Softplus etc. Moreover, our sample complexity beats existing results in the dependency of the number of hidden nodes and filter size. In fact, our result matches the information-theoretic lower bound for learning one-hidden-layer CNNs with linear activation functions, suggesting that our sample complexity is tight. Our theoretical analysis is backed up by numerical experiments.