Gradient Descent
Review for NeurIPS paper: On the universality of deep learning
The paper shows that any functional class that can be learned in polynomial time by some algorithm can be learned in polynomial time by deep neural networks using stochastic gradient descent. This sheds light, in part, on the empirical success of deep learning, and makes an important contribution toward furthering our understanding of efficient learning of neural networks. Authors complement the result with several extensions, including (a) showing that the results hold even when polynomial noise is added to the gradients or when weights can be of polynomial precision, (b) showing that a network of size O(n 2) and depth O(log(n)) can learn parities using SGD in poly time, and (c) lower bounds for descent algorithms characterized in terms of novel properties of networks that may be of independent interest. The paper reads very well, and the results and insights in the paper are very compelling.
Review for NeurIPS paper: Estimating Training Data Influence by Tracing Gradient Descent
Weaknesses: I have some major concerns with the evaluation part of the paper. A simple baseline could be a loss based selection method. Simply select training points based on loss change. A recent paper [DataLens IJCNN 20] shows that a simple loss based selection outperforms both influence functions and representer selection on mislabelled data identification when the mislabeled data is small. As the fraction of mislabelled data increases, influence function works better than loss based method.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
The authors identify a strategy for compensating agents to disclose private and relevant data. The strategy draws on interesting ideas from the machine learning literature, including the use of the stochastic gradient descent algorithm to set the payment for the data. This allows effective compensation of agents while maintaining a limited budget. The authors also include mechanisms for preserving the privacy of agents, and identification of different "profit maximizing" strategies for agent to select given their confidence in their data. It appears that the substantive contribution is in identifying a mechanism that compensates agents for their data while maintaing a bounded budget.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
This paper proposes a computationally efficient way of scaling kernel non-linear component analysis. Each of these methods works effectively in certain settings but the number of samples/features may be prohibitively large in some settings. This paper proposes a doubley-stochastic gradient algorithm for solving kernel eigenvalue problems encompassing kernel PCA, CCA, and SVD. It is called doubly-stochastic because sampling is performed in the data samples and the random features (for this paper, limited to stationary kernels whose Fourier transform is well-defined). The paper claims convergence rate of \tilde{O}(1/t) to the global optimum for the recovered eigensubspace (in terms of principal angle), but with a caveat being that the step sizes are chosen properly and the mini-batch size is sufficiently large.
Generative-enhanced optimization for knapsack problems: an industry-relevant study
Vodovozova, Yelyzaveta, Awasthi, Abhishek, Jones, Caitlin, Doetsch, Joseph, Wintersperger, Karen, Krellner, Florian, Riofrío, Carlos A.
Optimization is a crucial task in various industries such as logistics, aviation, manufacturing, chemical, pharmaceutical, and insurance, where finding the best solution to a problem can result in significant cost savings and increased efficiency. Tensor networks (TNs) have gained prominence in recent years in modeling classical systems with quantum-inspired approaches. More recently, TN generative-enhanced optimization (TN-GEO) has been proposed as a strategy which uses generative modeling to efficiently sample valid solutions with respect to certain constraints of optimization problems. Moreover, it has been shown that symmetric TNs (STNs) can encode certain constraints of optimization problems, thus aiding in their solution process. In this work, we investigate the applicability of TN- and STN-GEO to an industry relevant problem class, a multi-knapsack problem, in which each object must be assigned to an available knapsack. We detail a prescription for practitioners to use the TN-and STN-GEO methodology and study its scaling behavior and dependence on its hyper-parameters. We benchmark 60 different problem instances and find that TN-GEO and STN-GEO produce results of similar quality to simulated annealing.
Implicit Bias of SignGD and Adam on Multiclass Separable Data
Fan, Chen, Schmidt, Mark, Thrampoulidis, Christos
In the optimization of overparameterized models, different gradient-based methods can achieve zero training error yet converge to distinctly different solutions inducing different generalization properties. While a decade of research on implicit optimization bias has illuminated this phenomenon in various settings, even the foundational case of linear classification with separable data still has important open questions. We resolve a fundamental gap by characterizing the implicit bias of both Adam and Sign Gradient Descent in multi-class cross-entropy minimization: we prove that their iterates converge to solutions that maximize the margin with respect to the classifier matrix's max-norm and characterize the rate of convergence. We extend our results to general p-norm normalized steepest descent algorithms and to other multi-class losses.
Any-stepsize Gradient Descent for Separable Data under Fenchel--Young Losses
Bao, Han, Sakaue, Shinsaku, Takezawa, Yuki
The gradient descent (GD) has been one of the most common optimizer in machine learning. In particular, the loss landscape of a neural network is typically sharpened during the initial phase of training, making the training dynamics hover on the edge of stability. This is beyond our standard understanding of GD convergence in the stable regime where arbitrarily chosen stepsize is sufficiently smaller than the edge of stability. Recently, Wu et al. (COLT2024) have showed that GD converges with arbitrary stepsize under linearly separable logistic regression. Although their analysis hinges on the self-bounding property of the logistic loss, which seems to be a cornerstone to establish a modified descent lemma, our pilot study shows that other loss functions without the self-bounding property can make GD converge with arbitrary stepsize. To further understand what property of a loss function matters in GD, we aim to show arbitrary-stepsize GD convergence for a general loss function based on the framework of \emph{Fenchel--Young losses}. We essentially leverage the classical perceptron argument to derive the convergence rate for achieving $\epsilon$-optimal loss, which is possible for a majority of Fenchel--Young losses. Among typical loss functions, the Tsallis entropy achieves the GD convergence rate $T=\Omega(\epsilon^{-1/2})$, and the R{\'e}nyi entropy achieves the far better rate $T=\Omega(\epsilon^{-1/3})$. We argue that these better rate is possible because of \emph{separation margin} of loss functions, instead of the self-bounding property.
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Atanasov, Alexander, Bordelon, Blake, Zavatone-Veth, Jacob A., Paquette, Courtney, Pehlevan, Cengiz
Modern deep learning practice is governed by the surprising predictability of performance improvement with increases in the scale of data, model size, and compute [17]. Often, the scaling of performance as a function of these quantities exhibits remarkably regular power law behavior, termed a neural scaling law [2, 6, 12, 13, 15, 16, 18, 19, 22, 32]. Here, performance is usually measured by some differentiable loss on the predictions of the model on a held out test set representative of the population. Given the relatively universal behavior of the exponents across architectures and optimizers [11, 18, 19], one might hope that relatively simple models of information processing systems might be able to recover the same types of scaling laws. The (stochastic) gradient descent (SGD) dynamics in random feature models were analyzed in recent works [7, 20, 26] which exhibits a surprising breadth of scaling behavior and captures several interesting phenomena in deep network training. Each of the above works has isolated various effects that can hurt performance compared to the idealized infinite data and infinite model size limits. The model was first studied in [7], where the bottlenecks due to finite width and finite dataset size were computed and, for certain data structure, resulted in a Chinchilla-type scaling result as in [18].
Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries
Kolb, Chris, Weber, Tobias, Bischl, Bernd, Rügamer, David
Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the $L_1$ norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of $L_1$-penalized neural networks by adding differentiable $L_2$ regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.