Re-examining Double Descent and Scaling Laws under Norm-based Capacity via Deterministic Equivalence

Wang, Yichen, Chen, Yudong, Rosasco, Lorenzo, Liu, Fanghui

arXiv.org Machine Learning 

The number of parameters, i.e., model size, provides a basic measure of the capacity of a machine learning (ML) model. However it is well known that it might not describe the effective model capacity (Bartlett, 1998), especially for over-parameterized neural networks (Belkin et al., 2018; Zhang et al., 2021) and large language models (Brown et al., 2020). The focus on the number of parameters results in an inaccurate characterization of the relationship between the test risk R, training data size n, and model size p, which is central in ML to understand the bias-variance trade-off (Vapnik, 1995), double descent (Belkin et al., 2019) and scaling laws (Kaplan et al., 2020; Xiao, 2024). For example, even for the same architecture (model size), the test error behavior can be totally different (Nakkiran et al., 2020, 2021), e.g., double descent may disappear. Here we shift the focus from model size to weights and consider their norm, a perspective pioneered in the classical results in Bartlett (1998). Indeed, norm based capacity/complexity are widely considered to be more effective in characterizing generalization behavior, see e.g.