Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed
Refinetti, Maria, Goldt, Sebastian, Krzakala, Florent, Zdeborová, Lenka
Explaining the success of deep neural networks in many areas of machine learning remains a key challenge for learning theory. A series of recent theoretical works made progress towards this goal by proving trainability of two-layer neural networks (2LNN) with gradient-based methods [1-6]. These results are based on the observation that strongly over-parameterised 2LNN can achieve good performance even if their first-layer weights remain almost constant throughout training. This is the case if the initial weights are chosen with a particular scaling, which was dubbed the "lazy regime" by Chizat et al. [7]. This behaviour is to be contrasted with the "feature learning regime", where the weights of the first layer move significantly during training. Going a step further, simply fixing the first-layer weights of a 2LNN at their initial values yields the well-known random features model of Rahimi & Recht [8, 9], and can be seen as an approximation of kernel learning [10]. Recent empirical studies showed that on some benchmark data sets in computer vision, kernels derived from neural networks achieve comparable performance to neural networks [11-16]. These results raise the question of whether neural networks only learn successfully if random features can also learn successfully, and have led to a renewed interest in the exact conditions under which neural networks trained with gradient descent achieve a better performance than random features [17-20].
Feb-23-2021
- Country:
- Europe
- France (0.14)
- Italy (0.14)
- Switzerland (0.14)
- North America > United States (0.14)
- Europe
- Genre:
- Research Report > New Finding (0.48)
- Technology: