Goto

Collaborating Authors

 learning curve



Learning Curves and Benign Overfitting of Spectral Algorithms in Large Dimensions

arXiv.org Machine Learning

Existing large-dimensional theory for spectral algorithms resolves either the optimally tuned point or the interpolation limit, but leaves the under-regularized regime unexplored. We study the learning curve and benign overfitting of spectral algorithms in the largedimensional setting where the sample size and dimension are of comparable order, i.e., n dฮณ for some ฮณ > 0. We first consider inner-product kernels on the sphere Sd 1 and establish a sharp asymptotic characterization of the excess risk across the full regularization path under various source conditions s 0, where smeasures the relative smoothness of the regression function. Our results reveal that the learning curve is not simply U-shaped but instead consists of three distinct regimes: over-regularized, under-regularized, and interpolation regimes. This characterization allows us to fully capture the benign overfitting phenomenon, demonstrating that benign overfitting arises consistently across both the under-regularized and interpolation regimes whenever sis positive but no larger than a critical threshold. We further show that, in the sufficiently regularized regime, the kernel learning curve is recovered by an associated sequence model. Finally, we extend the learning-curve analysis to large-dimensional KRR for a class of kernels on general domains in Rd whose low-degree eigenspaces satisfy spectral-scaling and hyper-contractivity conditions. Keywords: Spectral algorithms, learning curves, high dimension, benign overfitting. 1 Introduction Nonparametric regression studies the estimation of an unknown function f: Rd R from ni.i.d.



example where multi step outperforms one step

Neural Information Processing Systems

As explained in the main text, this section presents an example that is only a slight modification of the one in Figure 4, but where a multi-step approach is clearly preferred over just one step. The data-generating and learning processes are exactly the same (100 trajectories of length 100, discount 0.9, ฮฑ = 0.1for reverse KL regularization). The only difference is that rather than using a behavior that is a mixture of optimal and uniform, we use a behavior that is a mixture of maximally suboptimal and uniform. If we call the suboptimal policy ฯ€ (which always goes down and left in our gridworld), then the behavior for the modified example is ฮฒ = 0.2 ฯ€ +0.8 u, where uis uniform. Results are shown in Figure 7. Figure 7: A gridworld example with modified behavior where multi-step is much better than one-step.



Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression

Neural Information Processing Systems

As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes. Currently, theoretical understanding of the learning curves that characterize how the prediction error depends on the number of samples is restricted to either largesample asymptotics (m!1) or, for certain simple data distributions, to the high-dimensional asymptotics in which the number of samples scales linearly with the dimension (m / d). There is a wide gulf between these two regimes, including all higher-order scaling relations m / dr, which are the subject of the present paper. We focus on the problem of kernel ridge regression for dot-product kernels and present precise formulas for the mean of the test error, bias, and variance, for data drawn uniformly from the sphere with isotropic random labels in the rth-order asymptotic scaling regime m!1 with m/dr held constant. We observe a peak in the learning curve whenever m dr/r! for any integer r, leading to multiple sample-wise descent and nontrivial behavior at multiple scales. We include a colab2 notebook that reproduces the essential results of the paper.


AMore Discussion

Neural Information Processing Systems

Why One-step and IQL are imitation-based methods? The core difference between RL-based and imitation-based methods is that RL-based methods learn a value function of policy ฯ€ while imitation-based methods don't. Learning the value function of ฯ€ requires off-policy evaluation of ฯ€ (i.e., learning Qฯ€ or Vฯ€), which is prone to distribution shift. The policy evaluation and policy improvement will also affect each other as they are coupled. Imitation-based methods don't learn Qฯ€ or Vฯ€, but some of them do learn a value function.