This paper proves an abstract theorem addressing in a unified manner two important problems in function approximation: avoiding curse of dimensionality and estimating the degree of approximation for out-of-sample extension in manifold learning. We consider an abstract (shallow) network that includes, for example, neural networks, radial basis function networks, and kernels on data defined manifolds used for function approximation in various settings. A deep network is obtained by a composition of the shallow networks according to a directed acyclic graph, representing the architecture of the deep network. In this paper, we prove dimension independent bounds for approximation by shallow networks in the very general setting of what we have called $G$-networks on a compact metric measure space, where the notion of dimension is defined in terms of the cardinality of maximal distinguishable sets, generalizing the notion of dimension of a cube or a manifold. Our techniques give bounds that improve without saturation with the smoothness of the kernel involved in an integral representation of the target function. In the context of manifold learning, our bounds provide estimates on the degree of approximation for an out-of-sample extension of the target function to the ambient space. One consequence of our theorem is that without the requirement of robust parameter selection, deep networks using a non-smooth activation function such as the ReLU, do not provide any significant advantage over shallow networks in terms of the degree of approximation alone.

We show that deep networks are better than shallow networks at approximating functions that can be expressed as a composition of functions described by a directed acyclic graph, because the deep networks can be designed to have the same compositional structure, while a shallow network cannot exploit this knowledge. Thus, the blessing of compositionality mitigates the curse of dimensionality. On the other hand, a theorem called good propagation of errors allows to `lift' theorems about shallow networks to those about deep networks with an appropriate choice of norms, smoothness, etc. We illustrate this in three contexts where each channel in the deep network calculates a spherical polynomial, a non-smooth ReLU network, or another zonal function network related closely with the ReLU network.

Mhaskar, H. N., Cloninger, A., Cheng, X.

In machine learning, we are given a dataset of the form $\{(\mathbf{x}_j,y_j)\}_{j=1}^M$, drawn as i.i.d. samples from an unknown probability distribution $\mu$; the marginal distribution for the $\mathbf{x}_j$'s being $\mu^*$. We propose that rather than using a positive kernel such as the Gaussian for estimation of these measures, using a non-positive kernel that preserves a large number of moments of these measures yields an optimal approximation. We use multi-variate Hermite polynomials for this purpose, and prove optimal and local approximation results in a supremum norm in a probabilistic sense. Together with a permutation test developed with the same kernel, we prove that the kernel estimator serves as a `witness function' in classification problems. Thus, if the value of this estimator at a point $\mathbf{x}$ exceeds a certain threshold, then the point is reliably in a certain class. This approach can be used to modify pretrained algorithms, such as neural networks or nonlinear dimension reduction techniques, to identify in-class vs out-of-class regions for the purposes of generative models, classification uncertainty, or finding robust centroids. This fact is demonstrated in a number of real world data sets including MNIST, CIFAR10, Science News documents, and LaLonde data sets.

The problem of super-resolution in general terms is to recuperate a finitely supported measure $\mu$ given finitely many of its coefficients $\hat{\mu}(k)$ with respect to some orthonormal system. The interesting case concerns situations, where the number of coefficients required is substantially smaller than a power of the reciprocal of the minimal separation among the points in the support of $\mu$. In this paper, we consider the more severe problem of recuperating $\mu$ approximately without any assumption on $\mu$ beyond having a finite total variation. In particular, $\mu$ may be supported on a continuum, so that the minimal separation among the points in the support of $\mu$ is $0$. A variant of this problem is also of interest in machine learning as well as the inverse problem of de-convolution. We define an appropriate notion of a distance between the target measure and its recuperated version, give an explicit expression for the recuperation operator, and estimate the distance between $\mu$ and its approximation. We show that these estimates are the best possible in many different ways. We also explain why for a finitely supported measure the approximation quality of its recuperation is bounded from below if the amount of information is smaller than what is demanded in the super-resolution problem.

Dealing with massive data is a challenging task for machine learning. An important aspect of machine learning is function approximation. In the context of massive data, some of the commonly used tools for this purpose are sparsity, divide-and-conquer, and distributed learning. In this paper, we develop a very general theory of approximation by networks, which we have called eignets, to achieve local, stratified approximation. The very massive nature of the data allows us to use these eignets to solve inverse problems such as finding a good approximation to the probability law that governs the data, and finding the local smoothness of the target function near different points in the domain. In fact, we develop a wavelet-like representation using our eignets. Our theory is applicable to approximation on a general locally compact metric measure space. Special examples include approximation by periodic basis functions on the torus, zonal function networks on a Euclidean sphere (including smooth ReLU networks), Gaussian networks, and approximation on manifolds. We construct pre-fabricated networks so that no data-based training is required for the approximation.