leaky-relu
Minimum Width for Deep, Narrow MLP: A Diffeomorphism Approach
Recently, there has been a growing focus on determining the minimum width requirements for achieving the universal approximation property in deep, narrow Multi-Layer Perceptrons (MLPs). Among these challenges, one particularly challenging task is approximating a continuous function under the uniform norm, as indicated by the significant disparity between its lower and upper bounds. To address this problem, we propose a framework that simplifies finding the minimum width for deep, narrow MLPs into determining a purely geometrical function denoted as $w(d_x, d_y)$. This function relies solely on the input and output dimensions, represented as $d_x$ and $d_y$, respectively. Two key steps support this framework. First, we demonstrate that deep, narrow MLPs, when provided with a small additional width, can approximate a $C^2$-diffeomorphism. Subsequently, using this result, we prove that $w(d_x, d_y)$ equates to the optimal minimum width required for deep, narrow MLPs to achieve universality. By employing the aforementioned framework and the Whitney embedding theorem, we provide an upper bound for the minimum width, given by $\operatorname{max}(2d_x+1, d_y) + \alpha(\sigma)$, where $0 \leq \alpha(\sigma) \leq 2$ represents a constant depending on the activation function. Furthermore, we provide a lower bound of $4$ for the minimum width in cases where the input and output dimensions are both equal to two.
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > Middle East > Jordan (0.04)
Minimum width for universal approximation using ReLU networks on compact domain
Kim, Namjun, Min, Chanho, Park, Sejun
Understanding what neural networks can or cannot do is a fundamental problem in the expressive power of neural networks. Initial approaches for this problem mostly focus on depth-bounded networks. For example, a line of research studies the size of the two-layer neural network to memorize (i.e., perfectly fit) an arbitrary training dataset and shows that the number of parameters proportional to the dataset size is necessary and sufficient for various activation functions (Baum, 1988; Huang and Babri, 1998). Another important line of works investigates a class of functions that can be approximated by two-layer networks. Classical results in this field represented by the universal approximation theorem show that two-layer networks using a non-polynomial activation function are dense in the space of continuous functions on compact domains (Hornik et al., 1989; Cybenko, 1989; Leshno et al., 1993; Pinkus, 1999). Along with the success of deep learning, the expressive power of deep neural networks has been studied. As in the classical depth-bounded network results, several works have shown that deep neural networks with bounded width can memorize arbitrary training dataset (Yun et al., 2019; Vershynin, 2020) and can approximate any continuous function (Lu et al., 2017; Hanin and Sellke, 2017). Intriguingly, it has also been shown that deeper networks can be more expressive compared to shallow ones. For example, Telgarsky (2016); Eldan and Shamir (2016); Daniely (2017) show that there is a class of functions that can be approximated by deep width-bounded networks with a small number of parameters but cannot be approximated by shallow networks without extremely large width.
APTx: better activation function than MISH, SWISH, and ReLU's variants used in deep learning
Activation Functions introduce non-linearity in the deep neural networks. This nonlinearity helps the neural networks learn faster and efficiently from the dataset. In deep learning, many activation functions are developed and used based on the type of problem statement. ReLU's variants, SWISH, and MISH are goto activation functions. MISH function is considered having similar or even better performance than SWISH, and much better than ReLU. In this paper, we propose an activation function named APTx which behaves similar to MISH, but requires lesser mathematical operations to compute. The lesser computational requirements of APTx does speed up the model training, and thus also reduces the hardware requirement for the deep learning model.
- Asia > Singapore (0.05)
- Asia > India > Uttar Pradesh (0.05)
Empirical study of the modulus as activation function in computer vision applications
Vallés-Pérez, Iván, Soria-Olivas, Emilio, Martínez-Sober, Marcelino, Serrano-López, Antonio J., Vila-Francés, Joan, Gómez-Sanchís, Juan
In this work we propose a new non-monotonic activation function: the modulus. The majority of the reported research on nonlinearities is focused on monotonic functions. We empirically demonstrate how by using the modulus activation function on computer vision tasks the models generalize better than with other nonlinearities - up to a 15% accuracy increase in CIFAR100 and 4% in CIFAR10, relative to the best of the benchmark activations tested. With the proposed activation function the vanishing gradient and dying neurons problems disappear, because the derivative of the activation function is always 1 or -1. The simplicity of the proposed function and its derivative make this solution specially suitable for TinyML and hardware applications.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
- (5 more...)