Re-examining Double Descent and Scaling Laws under Norm-based Capacity via Deterministic Equivalence

Wang, Yichen, Chen, Yudong, Rosasco, Lorenzo, Liu, Fanghui

Feb-3-2025–arXiv.org Machine Learning

The number of parameters, i.e., model size, provides a basic measure of the capacity of a machine learning (ML) model. However it is well known that it might not describe the effective model capacity (Bartlett, 1998), especially for over-parameterized neural networks (Belkin et al., 2018; Zhang et al., 2021) and large language models (Brown et al., 2020). The focus on the number of parameters results in an inaccurate characterization of the relationship between the test risk R, training data size n, and model size p, which is central in ML to understand the bias-variance trade-off (Vapnik, 1995), double descent (Belkin et al., 2019) and scaling laws (Kaplan et al., 2020; Xiao, 2024). For example, even for the same architecture (model size), the test error behavior can be totally different (Nakkiran et al., 2020, 2021), e.g., double descent may disappear. Here we shift the focus from model size to weights and consider their norm, a perspective pioneered in the classical results in Bartlett (1998). Indeed, norm based capacity/complexity are widely considered to be more effective in characterizing generalization behavior, see e.g.

artificial intelligence, machine learning, regime, (18 more...)

arXiv.org Machine Learning

Feb-3-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Wisconsin (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Government > Regional Government > North America Government > United States Government (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (0.66)
  - Statistical Learning (1.00)