Interplay between depth of neural networks and locality of target functions
Deep neural networks (DNNs) have achieved unparalleled success in various tasks of artificial intelligence such as image classification [1, 2] and speech recognition [3]. Empirically, DNNs often outperform other machine learning methods such as kernel methods and Gaussian processes, but little is known about the underlying mechanism of outstanding performance of DNNs. To elucidate benefits of depth, numerous studies have investigated properties of DNNs from various perspectives. The approximation theory focuses on the expressive power of DNNs [4]. Although the universal approximation theorem states that a sufficiently wide neural network with a single hidden layer can approximate any continuous functions, expressivity of a DNN grows exponentially with increasing the depth rather than the width [5-8]. In statistical learning theory, the decay rate of the generalization error in large sample asymptotics has been analyzed. For learning generic smooth functions, shallow networks or other standard methods with linear estimators such as kernel methods already give the optimal rate [9], and hence benefits of depth are not obvious.
Jan-28-2022