Generalization for Least Squares Regression With Simple Spiked Covariances

Li, Jiping, Sonthalia, Rishi

arXiv.org Machine Learning 

Random matrix theory has proven to be a valuable tool in analyzing the generalization of linear models. However, the generalization properties of even two-layer neural networks trained by gradient descent remain poorly understood. To understand the generalization performance of such networks, it is crucial to characterize the spectrum of the feature matrix at the hidden layer. Recent work has made progress in this direction by describing the spectrum after a single gradient step, revealing a spiked covariance structure. Y et, the generalization error for linear models with spiked covariances has not been previously determined. We derive their generalization error in the asymptotic proportional regime. Our analysis demonstrates that the eigenvector and eigenvalue corresponding to the spike significantly influence the generalization error. Significant theoretical work has been dedicated to understanding generalization in linear regression models (Dobriban & Wager, 2018; Advani et al., 2020; Mel & Ganguli, 2021; Derezinski et al., 2020; Hastie et al., 2022; Kausik et al., 2024; Wang et al., 2024a). For the random features approximation, the first layer of the neural network is considered fixed, and only the outer layer is trained. It has been shown that to understand the generalization, we need to analyze the distribution of singular values of F . Works such as Pennington & Worah (2017); Adlam et al. (2019); Benigni & Péché (2021); Fan & Wang (2020); Wang & Zhu (2024); Péché (2019); Piccolo & Schröder (2021) have studied the spectrum of F in the asymptotic limit, enabling us to understand the generalization. However, random feature models do not leverage the feature learning capabilities of neural networks. To gain further insights into the performance of two-layer neural networks and their feature learning capabilities, we need to train the inner layer. Recent studies such as Ba et al. (2022); Moniri et al. (2023) have examined the effects on F of taking one gradient step for the inner layer. Specifically, Ba et al. (2022) showed that with a sufficiently large step size η, two-layer models can already outperform random feature models after just one step. Moniri et al. (2023) extended this work to study many different scales for the step size. The bulk corresponds to F 0, while the spikes represent the effect of P .