to

### Deep Network with Approximation Error Being Reciprocal of Width to Power of Square Root of Depth

Recently, there has been a large number of successful real-world applications of deep neural networks in many fields of computer science and engineering, especially for large-scale and high-dimensional learning problems. Understanding the approximation capacity of deep neural networks has become a fundamental research direction for revealing the advantages of deep learning compared to traditional methods. This paper introduces new theories and network architectures achieving root exponential convergence and avoiding the curse of dimensionality simultaneously for (Hölder) continuous functions with an explicit error bound in deep network approximation, which might be two foundational laws supporting the application of deep network approximation in large-scale and high-dimensional problems. The approximation results here are quantitative and apply to networks with essentially arbitrary width and depth. These results suggest considering Floor-ReLU networks as a possible alternative to ReLU networks in deep learning.

### Deep Network Approximation for Smooth Functions

Deep neural networks have made significant impacts in many fields of computer science and engineering especially for large-scale and high-dimensional learning problems. Well-designed neural network architectures, efficient training algorithms, and high-performance computing technologies have made neural-network-based methods very successful in tremendous real applications. Especially in supervised learning, e.g., image classification and objective detection, the great advantages of neural-network-based methods have been demonstrated over traditional learning methods. Mathematically speaking, supervised learning is essentially a regression problem where the problem of function approximation plays a fundamental role. Understanding the approximation capacity of deep neural networks has become a key question for revealing the power of deep learning.

### Neural Network Approximation: Three Hidden Layers Are Enough

In particular, leveraging the power of advanced yet simple activation functions, we will introduce new theories and network architectures with only three hidden layers achieving exponential convergence and avoiding the curse of dimensionality simultaneously for (Hölder) continuous functions with an explicit approximation bound. The theories established in this paper would provide new insights to explain why deeper neural networks are better than one-hidden-layer neural networks for large-scale and high-dimensional problems. The approximation theories here are constructive (i.e., with explicit formulas to specify network parameters) and quantitative (i.e., results valid for essentially arbitrary width and/or depth without lower bound constraints) with explicit error bounds working for three-hiddenlayer networks with arbitrary width. Constructive approximation with quantitative results and explicit error bounds would provide important guides for deciding the network sizes in deep learning.

### Nonlinear Approximation via Compositions

We study the approximation efficiency of function compositions in nonlinear approximation, especially the case when compositions are implemented using multi-layer feed-forward neural networks (FNNs) with ReLU activation functions. The central question of interest is what are the advantages of function compositions in generating dictionaries and what is the optimal implementation of function compositions via ReLU FNNs, especially in modern computing architecture. This question is answered by studying the $N$-term approximation rate, which is the decrease in error versus the number of computational nodes (neurons) in the approximant, together with parallel efficiency for the first time. First, for an arbitrary function $f$ on $[0,1]$, regardless of its smoothness and even the continuity, if $f$ can be approximated via nonlinear approximation using one-hidden-layer ReLU FNNs with an approximation rate $O(N^{-\eta})$, we quantitatively show that dictionaries with function compositions via deep ReLU FNNs can improve the approximation rate to $O(N^{-2\eta})$. Second, for H{\"o}lder continuous functions of order $\alpha$ with a uniform Lipchitz constant $\omega$ on a $d$-dimensional cube, we show that the $N$-term approximation via ReLU FNNs with two or three function compositions can achieve an approximation rate $O( N^{-2\alpha/d})$. The approximation rate can be improved to $O(L^{-2\alpha/d})$ by composing $L$ times, if $N$ is fixed and sufficiently large; but further compositions cannot achieve the approximation rate $O(N^{-\alpha L/d})$. Finally, considering the computational efficiency per training iteration in parallel computing, FNNs with $O(1)$ hidden layers are an optimal choice for approximating H{\"o}lder continuous functions if computing resources are enough.

### The Expressive Power of Neural Networks: A View from the Width

The expressive power of neural networks is important for understanding deep learning. Most existing works consider this problem from the view of the depth of a network. In this paper, we study how width affects the expressiveness of neural networks. Classical results state that depth-bounded (e.g. depth-2) networks with suitable activation functions are universal approximators. We show a universal approximation theorem for width-bounded ReLU networks: width-(n + 4) ReLU networks, where n is the input dimension, are universal approximators. Moreover, except for a measure zero set, all functions cannot be approximated by width-n ReLU networks, which exhibits a phase transition. Several recent works demonstrate the benefits of depth by proving the depth-efficiency of neural networks. That is, there are classes of deep networks which cannot be realized by any shallow network whose size is no more than an exponential bound. Here we pose the dual question on the width-efficiency of ReLU networks: Are there wide networks that cannot be realized by narrow networks whose size is not substantially larger? We show that there exist classes of wide networks which cannot be realized by any narrow network whose depth is no more than a polynomial bound. On the other hand, we demonstrate by extensive experiments that narrow networks whose size exceed the polynomial bound by a constant factor can approximate wide and shallow network with high accuracy. Our results provide more comprehensive evidence that depth may be more effective than width for the expressiveness of ReLU networks.