Gradient Descent
Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization
Fox, Derek, Hernandez, Samuel, Tong, Qianqian
Stochastic optimization algorithms are widely used for large-scale data analysis due to their low per-iteration costs, but they often suffer from slow asymptotic convergence caused by inherent variance. Variance-reduced techniques have been therefore used to address this issue in structured sparse models utilizing sparsity-inducing norms or $\ell_0$-norms. However, these techniques are not directly applicable to complex (non-convex) graph sparsity models, which are essential in applications like disease outbreak monitoring and social network analysis. In this paper, we introduce two stochastic variance-reduced gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and GraphSCSG-IHT. We provide a general framework for theoretical analysis, demonstrating that our methods enjoy a linear convergence speed. Extensive experiments validate
Learning Non-Vacuous Generalization Bounds from Optimization
Tan, Chengli, Zhang, Jiangshe, Liu, Junmin
Deep neural networks (DNNs) have shown remarkable performance in a wide range of tasks over the past decade (Bengio et al. 2021). A mystery is that they generalize surprisingly well on unseen data, though having far more trainable parameters than the number of training examples (Belkin et al. 2019, Li et al. 2023). This phenomenon of benign overfitting inevitably casts shadows on the classical theory of statistical learning, which posits that models with high complexity tend to overfit the training data, whereas models with low complexity tend to underfit the training data. To reconcile the conflicts, some researchers argue that this is due to the regularization incurred during training, either implicitly imposed via use of stochastic gradient descent (SGD) (Advani et al. 2020, Barrett & Dherin 2021, Smith et al. 2021, Sclocchi & Wyart 2024) or explicitly via batch normalization (Ioffe & Szegedy 2015), weight decay (Krogh & Hertz 1992), dropout (Srivastava et al. 2014), etc. However, Zhang et al. (2017) questioned this widely received wisdom because they found that DNNs are still able to achieve zero training error with randomly labeled examples, which apparently cannot generalize. Prior to our work, there has been extensive study trying to explain the generalization behavior of DNNs and they roughly can be categorized into the following classes. The first class is the so-called norm-based bounds (Neyshabur et al. 2015, Bartlett et al. 2017, Neyshabur et al. 2018, Golowich et al. 2018) that are composed of the operator norm of layerwise weight matrices. However, recent studies suggest that these norm-based bounds might be problematic as they abnormally increase with the number of training examples (Nagarajan & Kolter 2019). Moreover, norm-based bounds are numerically vacuous as they are even several orders of magnitude larger than the number of network parameters.
Multiple importance sampling for stochastic gradient estimation
Salaรผn, Corentin, Huang, Xingchang, Georgiev, Iliyan, Mitra, Niloy J., Singh, Gurprit
We introduce a theoretical and practical framework for efficient importance sampling of mini-batch samples for gradient estimation from single and multiple probability distributions. To handle noisy gradients, our framework dynamically evolves the importance distribution during training by utilizing a self-adaptive metric. Our framework combines multiple, diverse sampling distributions, each tailored to specific parameter gradients. This approach facilitates the importance sampling of vector-valued gradient estimation. Rather than naively combining multiple distributions, our framework involves optimally weighting data contribution across multiple distributions. This adapted combination of multiple importance yields superior gradient estimates, leading to faster training convergence. We demonstrate the effectiveness of our approach through empirical evaluations across a range of optimization tasks like classification and regression on both image and point cloud datasets.
Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing
Perera, David, Letzelter, Victor, Mariotte, Thรฉo, Cortรฉs, Adrien, Chen, Mickael, Essid, Slim, Richard, Gaรซl
We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.
An Ad-hoc graph node vector embedding algorithm for general knowledge graphs using Kinetica-Graph
Karamete, B. Kaan, Glaser, Eli
This paper discusses how to generate general graph node embeddings from knowledge graph representations. The embedded space is composed of a number of sub-features to mimic both local affinity and remote structural relevance. These sub-feature dimensions are defined by several indicators that we speculate to catch nodal similarities, such as hop-based topological patterns, the number of overlapping labels, the transitional probabilities (markov-chain probabilities), and the cluster indices computed by our recursive spectral bisection (RSB) algorithm. These measures are flattened over the one dimensional vector space into their respective sub-component ranges such that the entire set of vector similarity functions could be used for finding similar nodes. The error is defined by the sum of pairwise square differences across a randomly selected sample of graph nodes between the assumed embeddings and the ground truth estimates as our novel loss function. The ground truth is estimated to be a combination of pairwise Jaccard similarity and the number of overlapping labels. Finally, we demonstrate a multi-variate stochastic gradient descent (SGD) algorithm to compute the weighing factors among sub-vector spaces to minimize the average error using a random sampling logic.
Regression under demographic parity constraints via unlabeled post-processing
Chzhen, Evgenii, Hebiri, Mohamed, Taturyan, Gayane
We address the problem of performing regression while ensuring demographic parity, even without access to sensitive attributes during inference. We present a general-purpose post-processing algorithm that, using accurate estimates of the regression function and a sensitive attribute predictor, generates predictions that meet the demographic parity constraint. Our method involves discretization and stochastic minimization of a smooth convex function. It is suitable for online post-processing and multi-class classification tasks only involving unlabeled data for the post-processing. Unlike prior methods, our approach is fully theory-driven. We require precise control over the gradient norm of the convex function, and thus, we rely on more advanced techniques than standard stochastic gradient descent. Our algorithm is backed by finite-sample analysis and post-processing bounds, with experimental results validating our theoretical findings.
SOREL: A Stochastic Algorithm for Spectral Risks Minimization
The spectral risk has wide applications in machine learning, especially in real-world decision-making, where people are not only concerned with models' average performance. By assigning different weights to the losses of different sample points, rather than the same weights as in the empirical risk, it allows the model's performance to lie between the average performance and the worst-case performance. In this paper, we propose SOREL, the first stochastic gradient-based algorithm with convergence guarantees for the spectral risk minimization. Previous algorithms often consider adding a strongly concave function to smooth the spectral risk, thus lacking convergence guarantees for the original spectral risk. We theoretically prove that our algorithm achieves a near-optimal rate of $\widetilde{O}(1/\sqrt{\epsilon})$ in terms of $\epsilon$. Experiments on real datasets show that our algorithm outperforms existing algorithms in most cases, both in terms of runtime and sample complexity.
A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent
Wang, Shuche, Tan, Vincent Y. F.
Distributed gradient descent algorithms have come to the fore in modern machine learning, especially in parallelizing the handling of large datasets that are distributed across several workers. However, scant attention has been paid to analyzing the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions instead of random noise. In this paper, we formulate a novel problem in which adversarial corruptions are present in a distributed learning system. We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm. Extensive convergence analysis for (strongly) convex loss functions is provided for different choices of the stepsize. We carefully optimize the stepsize schedule to accelerate the convergence of the algorithm, while at the same time amortizing the effect of the corruption over time. Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
Quantum Natural Stochastic Pairwise Coordinate Descent
Sohail, Mohammad Aamir, Khoozani, Mohsen Heidari, Pradhan, S. Sandeep
Quantum machine learning through variational quantum algorithms (VQAs) has gained substantial attention in recent years. VQAs employ parameterized quantum circuits, which are typically optimized using gradient-based methods. However, these methods often exhibit sub-optimal convergence performance due to their dependence on Euclidean geometry. The quantum natural gradient descent (QNGD) optimization method, which considers the geometry of the quantum state space via a quantum information (Riemannian) metric tensor, provides a more effective optimization strategy. Despite its advantages, QNGD encounters notable challenges for learning from quantum data, including the no-cloning principle, which prohibits the replication of quantum data, state collapse, and the measurement postulate, which leads to the stochastic loss function. This paper introduces the quantum natural stochastic pairwise coordinate descent (2-QNSCD) optimization method. This method leverages the curved geometry of the quantum state space through a novel ensemble-based quantum information metric tensor, offering a more physically realizable optimization strategy for learning from quantum data. To improve computational efficiency and reduce sample complexity, we develop a highly sparse unbiased estimator of the novel metric tensor using a quantum circuit with gate complexity $\Theta(1)$ times that of the parameterized quantum circuit and single-shot quantum measurements. Our approach avoids the need for multiple copies of quantum data, thus adhering to the no-cloning principle. We provide a detailed theoretical foundation for our optimization method, along with an exponential convergence analysis. Additionally, we validate the utility of our method through a series of numerical experiments.
Occam Gradient Descent
Deep learning neural network models must be large enough to adapt to their problem domain, while small enough to avoid overfitting training data during gradient descent. To balance these competing demands, overprovisioned deep learning models such as transformers are trained for a single epoch on large data sets, and hence inefficient with both computing resources and training data. In response to these inefficiencies, we exploit learning theory to derive Occam Gradient Descent, an algorithm that interleaves adaptive reduction of model size to minimize generalization error, with gradient descent on model weights to minimize fitting error. In contrast, traditional gradient descent greedily minimizes fitting error without regard to generalization error. Our algorithm simultaneously descends the space of weights and topological size of any neural network without modification, and is effective in our image classification experiments in outperforming traditional gradient descent with or without post-train pruning in loss, compute and model size. Furthermore, applying our algorithm to tabular data classification we find that across a range of data sets, neural networks trained with Occam Gradient Descent outperform neural networks trained with gradient descent, as well as Random Forests, in both loss and model size.