Goto

Collaborating Authors

 square loss


The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks

Farné, Gabriele, Boncoraglio, Fabrizio, Zdeborová, Lenka

arXiv.org Machine Learning

A key capability of modern neural networks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Yet, theoretical understanding of this dual capability remains limited. We introduce the Rules-and-Facts (RAF) model, a minimal solvable setting that enables precise characterization of this phenomenon by bridging two classical lines of work in the statistical physics of learning: the teacher-student framework for generalization and Gardner-style capacity analysis for memorization. In the RAF model, a fraction $1 - \varepsilon$ of training labels is generated by a structured teacher rule, while a fraction $\varepsilon$ consists of unstructured facts with random labels. We characterize when the learner can simultaneously recover the underlying rule - allowing generalization to new data - and memorize the unstructured examples. Our results quantify how overparameterization enables the simultaneous realization of these two objectives: sufficient excess capacity supports memorization, while regularization and the choice of kernel or nonlinearity control the allocation of capacity between rule learning and memorization. The RAF model provides a theoretical foundation for understanding how modern neural networks can infer structure while storing rare or non-compressible information.




A Proof of Theorem 1 Proof

Neural Information Processing Systems

Theorem 6 is stated in terms of Gaussian complexity. Ben-David (2014) has a full proof. M (α)null is the linear class following the depth-K neural network. The second term relies on the Lipschitz constant of DNN, which we bound with the following lemma. Similar results are given by Scaman and Virmaux (2018); Fazlyab et al. (2019).



Understanding Square Loss in Training Overparametrized Neural Network Classifiers

Neural Information Processing Systems

Deep learning has achieved many breakthroughs in modern classification tasks. Numerous architectures have been proposed for different data structures but when it comes to the loss function, the cross-entropy loss is the predominant choice. Recently, several alternative losses have seen revived interests for deep classifiers. In particular, empirical evidence seems to promote square loss but a theoretical justification is still lacking. In this work, we contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks in the neural tangent kernel (NTK) regime.


Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals

Neural Information Processing Systems

We study the fundamental problems of agnostically learning halfspaces and ReLUs under Gaussian marginals. In the former problem, given labeled examples $(\bx, y)$ from an unknown distribution on $\R^d \times \{ \pm 1\}$, whose marginal distribution on $\bx$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with 0-1 loss $\opt+\eps$, where $\opt$ is the 0-1 loss of the best-fitting halfspace. In the latter problem, given labeled examples $(\bx, y)$ from an unknown distribution on $\R^d \times \R$, whose marginal distribution on $\bx$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with square loss $\opt+\eps$, where $\opt$ is the square loss of the best-fitting ReLU. We prove Statistical Query (SQ) lower bounds of $d^{\poly(1/\eps)}$ for both of these problems. Our SQ lower bounds provide strong evidence that current upper bounds for these tasks are essentially best possible.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

"NIPS Neural Information Processing Systems 8-11th December 2014, Montreal, Canada",,, "Paper ID:","1461" "Title:","The limits of squared Euclidean distance regularization" Current Reviews First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper considers the problem of empirical risk minimization with squared distance regularization, which results in a weight vector that is a linear combination of the training examples. The authors prove a linear lower bound on the average square loss of the algorithm on random problems, provided the loss function is nice enough, while the same problem is easy to learn by another algorithm. This is a well-written paper on a simple idea and result, with a rather interesting interpretation. The proposed conjectures on random features and neural networks should be fleshed out in more detail, or at least with more empirical evidence.