Goto

Collaborating Authors

 nullw



When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling

Wen, Garrett G., Hu, Hong, Lu, Yue M., Fan, Zhou, Misiakiewicz, Theodor

arXiv.org Machine Learning

A major effort in modern high-dimensional statistics has been devoted to the analysis of linear predictors trained on nonlinear feature embeddings via empirical risk minimization (ERM). Gaussian equivalence theory (GET) has emerged as a powerful universality principle in this context: it states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates, which are more amenable to analysis. Despite its remarkable successes, numerical experiments show that this equivalence can fail even for simple embeddings -- such as polynomial maps -- under general scaling regimes. We investigate this breakdown in the setting of random feature (RF) models in the quadratic scaling regime, where both the number of features and the sample size grow quadratically with the data dimension. We show that when the target function depends on a low-dimensional projection of the data, such as generalized linear models, GET yields incorrect predictions. To capture the correct asymptotics, we introduce a Conditional Gaussian Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model. This hybrid model retains the tractability of the Gaussian framework and accurately describes RF models in the quadratic scaling regime. We derive sharp asymptotics for the training and test errors in this setting, which continue to agree with numerical simulations even when GET fails. Our analysis combines general results on CLT for Wiener chaos expansions and a careful two-phase Lindeberg swapping argument. Beyond RF models and quadratic scaling, our work hints at a rich landscape of universality phenomena in high-dimensional ERM.




comments. Reviewer # 1 wants to see an algorithm that works when b

Neural Information Processing Systems

We thank all the reviewers for their time and valuable comments. "Provide an algorithm to output a distribution that's close to the target, even if b has negative components." We will mention this in the paper. This is an interesting direction for future research. "What happens when we increase the number of layers?"





Appendix A Outline This appendix is organized as follows: In Section B we provide preliminaries and notations used

Neural Information Processing Systems

From eq. (8) we get a lower bound on the entries of We employ the dynamics equation eq. We employ the Grönwall's inequality. For D = 2 from eq. (20) we get γ (t) log null 1 + 8 α We show that Condition 8, which is equivalent to Condition 5, holds for the linearized model. From eq. (10) we have that γ Indeed from eqs. (13) and (11) we have ˆ w = lim We change variables t γ ( t) and using eq. Next, it is easy to verify that for all i = 1,...,d: ˆ w The proof is similar in spirit to the proof for the case D = 2 (see Appendix F.1). Proof.