sgn
Global Minimizers of ℓp-Regularized Objectives Yield the Sparsest ReLU Neural Networks
Overparameterized neural networks can interpolate a given dataset in many different ways, prompting the fundamental question: which among these solutions should we prefer, and what explicit regularization strategies will provably yield these solutions? This paper addresses the challenge of finding the sparsest interpolating ReLU network--i.e., the network with the fewest nonzero parameters or neurons--a goal with wide-ranging implications for efficiency, generalization, interpretability, theory, and model compression. Unlike post hoc pruning approaches, we propose a continuous, almost-everywhere differentiable training objective whose global minima are guaranteed to correspond to the sparsest singlehidden-layer ReLU networks that fit the data. This result marks a conceptual advance: it recasts the combinatorial problem of sparse interpolation as a smooth optimization task, potentially enabling the use of gradient-based training methods. Our objective is based on minimizing ℓp quasinorms of the weights for 0 < p < 1, a classical sparsity-promoting strategy in finite-dimensional settings. However, applying these ideas to neural networks presents new challenges: the function class is infinite-dimensional, and the weights are learned using a highly nonconvex objective. We prove that, under our formulation, global minimizers correspond exactly to sparsest solutions. Our work lays a foundation for understanding when and how continuous sparsity-inducing objectives can be leveraged to recover sparse networks through training.
Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks
Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay λ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive λtuning.
ec4f0b0a7557d6a51c42308800f2c23a-Supplemental-Conference.pdf
Let (x,y)be a binary classification task that admits a smooth separator as in Assumption 1. Then, there exists an RLC with neural network fθ and absolutely continuous randomness source u (Assumption 2) that is universal in the limit, i.e., Fθ (x) = y(x), x X, and makes random predictions that are correct with probability P(maj({sgn( a Further, if p is the number of parameters used by a deterministic neural network with one hidden layer to achieve zero-error in the task, fθ has at most p p +O(1)parameters. Since Assumption 1 holds3, there exists a single hidden-layer neural network N that, like s, achieves zero-error in this task [8]. Further, since sgn is nonpolynomial, we can use it as the nonlinearity of this network [21]. Putting it all together, there exists a number of hidden units M and parameters bj,oj R,wj Rd for j = 1,...,M such that N(x):= Note that this means we can achieve zero-error in classification, N(x) = y(x), x X.
Faster Directional Convergence of Linear Neural Networks under Spherically Symmetric Data
In this paper, we study gradient methods for training deep linear neural networks with binary cross-entropy loss. In particular, we show global directional convergence guarantees from a polynomial rate to a linear rate for (deep) linear networks with spherically symmetric data distribution, which can be viewed as a specific zero-margin dataset. Our results do not require the assumptions in other works such as small initial loss, presumed convergence of weight direction, or overparameterization. We also characterize our findings in experiments.