Goto

Collaborating Authors

 Gradient Descent


Solving the Traveling Salesman Problem via Different Quantum Computing Architectures

arXiv.org Artificial Intelligence

We study the application of emerging photonic and quantum computing architectures to solving the Traveling Salesman Problem (TSP), a well-known NP-hard optimization problem. We investigate several approaches: Simulated Annealing (SA), Quadratic Unconstrained Binary Optimization (QUBO-Ising) methods implemented on quantum annealers and Optical Coherent Ising Machines, as well as the Quantum Approximate Optimization Algorithm (QAOA) and the Quantum Phase Estimation (QPE) algorithm on gate-based quantum computers. QAOA and QPE were tested on the IBM Quantum platform. The QUBO-Ising method was explored using the D-Wave quantum annealer, which operates on superconducting Josephson junctions, and the QCI Dirac machine, a nonlinear optoelectronic Ising machine. Gate-based quantum computers demonstrated accurate results for small TSP instances in simulation. However, real quantum devices are hindered by noise and limited scalability. Circuit complexity grows with problem size, restricting performance to TSP instances with a maximum of 6 nodes. In contrast, Ising-based architectures show improved scalability for larger problem sizes. SQUID-based Ising machines can handle TSP instances with up to 12 nodes, while nonlinear optoelectronic Ising machines extend this capability to 18 nodes. Nevertheless, the solutions tend to be suboptimal due to hardware limitations and challenges in achieving ground state convergence as the problem size increases. Despite these limitations, Ising machines demonstrate significant time advantages over classical methods, making them a promising candidate for solving larger-scale TSPs efficiently.


Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models

arXiv.org Machine Learning

Unsupervised pre-training and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.


Local geometry of high-dimensional mixture models: Effective spectral theory and dynamical transitions

arXiv.org Machine Learning

We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices. We focus on settings where the data, $(Y_\ell)_{\ell =1}^n\in \mathbb R^d$, are i.i.d. draws of a $k$-component Gaussian mixture model, and the loss depends on the projection of the data into a fixed number of vectors, namely $\mathbf{x}^\top Y$, where $\mathbf{x}\in \mathbb{R}^{d\times C}$ are the parameters, and $C$ need not equal $k$. This setting captures a broad class of problems such as classification by one and two-layer networks and regression on multi-index models. We prove exact formulas for the limits of the empirical spectral distribution and outlier eigenvalues and eigenvectors of such matrices in the proportional asymptotics limit, where the number of samples and dimension $n,d\to\infty$ and $n/d=\phi \in (0,\infty)$. These limits depend on the parameters $\mathbf{x}$ only through the summary statistic of the $(C+k)\times (C+k)$ Gram matrix of the parameters and class means, $\mathbf{G} = (\mathbf{x},\mathbf{\mu})^\top(\mathbf{x},\mathbf{\mu})$. It is known that under general conditions, when $\mathbf{x}$ is trained by stochastic gradient descent, the evolution of these same summary statistics along training converges to the solution of an autonomous system of ODEs, called the effective dynamics. This enables us to connect the spectral theory to the training dynamics. We demonstrate our general results by analyzing the effective spectrum along the effective dynamics in the case of multi-class logistic regression. In this setting, the empirical Hessian and information matrices have substantially different spectra, each with their own static and even dynamical spectral transitions.


Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

arXiv.org Artificial Intelligence

In this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) approaches, and applying as a regularization function the Bregman divergence with two-parameter deformation of logarithm as a link function. This link function (referred to as the Euler logarithm) is associated with a wide class of generalized entropies. In order to derive novel GEG/MD updates, we estimate generalized exponential function, which closely approximates the inverse of the Euler two-parameter logarithm. The characteristic/shape and properties of the Euler logarithm and its inverse -- deformed exponential functions are tuned by two or even more hyperparameters. By learning these hyperparameters, we can adapt to distribution of training data, and we can adjust them to achieve desired properties of gradient descent algorithms. The concept of generalized entropies and associated deformed logarithms provide deeper insight into novel gradient descent updates. In literature, there exist nowadays over fifty mathematically well-defined entropic functionals and associated deformed logarithms, so impossible to investigate all of them in one research paper. Therefore, we focus here on a wide-class of trace-form entropies and associated generalized logarithm. We applied the developed algorithms for Online Portfolio Selection (OPLS) in order to improve its performance and robustness.


Single Domain Generalization with Model-aware Parametric Batch-wise Mixup

arXiv.org Artificial Intelligence

Single Domain Generalization (SDG) remains a formidable challenge in the field of machine learning, particularly when models are deployed in environments that differ significantly from their training domains. In this paper, we propose a novel data augmentation approach, named as Model-aware Parametric Batch-wise Mixup (MPBM), to tackle the challenge of SDG. MPBM deploys adversarial queries generated with stochastic gradient Langevin dynamics, and produces model-aware augmenting instances with a parametric batch-wise mixup generator network that is carefully designed through an innovative attention mechanism. By exploiting inter-feature correlations, the parameterized mixup generator introduces additional versatility in combining features across a batch of instances, thereby enhancing the capacity to generate highly adaptive and informative synthetic instances for specific queries. The synthetic data produced by this adaptable generator network, guided by informative queries, is expected to significantly enrich the representation space covered by the original training dataset and subsequently enhance the prediction model's generalizability across diverse and previously unseen domains. To prevent excessive deviation from the training data, we further incorporate a real-data alignment-based adversarial loss into the learning process of MPBM, regularizing any tendencies toward undesirable expansions. We conduct extensive experiments on several benchmark datasets. The empirical results demonstrate that by augmenting the training set with informative synthesis data, our proposed MPBM method achieves the state-of-the-art performance for single domain generalization.


Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks

arXiv.org Machine Learning

Deep networks often have an enormous amount of parameters an d are theoretically capable of overfitting the training data. However, in practice, deep networks trai ned via gradient descent (GD) or its variants often generalize well. This is commonly attributed to the implicit bias of GD, in which GD finds a certain solution that prevents overfitting ( Zhang et al., 2021; Neyshabur et al., 2017; Bartlett et al., 2021). Understanding the implicit bias of GD is one of the central topics in deep learni ng theory. The implicit bias of GD is relatively well-understood when t he network is homogeneous (see Soudry et al., 2018; Ji and Telgarsky, 2018; Lyu and Li, 2020; Ji and Telgarsky, 2020; Wu et al., 2023, and references therein). For linear networks trained on linearly separabl e data, GD diverges in norm while converging in direction to the maximum margin solution ( Soudry et al., 2018; Ji and Telgarsky, 2018; Wu et al., 2023). Similar results have been established for generic homogene ous networks that include a class of deep networks, assuming that the network at initialization can sepa rate the training data.


Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

arXiv.org Artificial Intelligence

In today's world, machine learning is hard to imagine without large training datasets and models. This has led to the use of stochastic methods for training, such as stochastic gradient descent (SGD). SGD provides weak theoretical guarantees of convergence, but there are modifications, such as Stochastic Variance Reduced Gradient (SVRG) and StochAstic Recursive grAdient algoritHm (SARAH), that can reduce the variance. These methods require the computation of the full gradient occasionally, which can be time consuming. In this paper, we explore variants of variance reduction algorithms that eliminate the need for full gradient computations. To make our approach memory-efficient and avoid full gradient computations, we use two key techniques: the shuffling heuristic and idea of SAG/SAGA methods. As a result, we improve existing estimates for variance reduction algorithms without the full gradient computations. Additionally, for the non-convex objective function, our estimate matches that of classic shuffling methods, while for the strongly convex one, it is an improvement. We conduct comprehensive theoretical analysis and provide extensive experimental results to validate the efficiency and practicality of our methods for large-scale machine learning problems.


Weighted Low-rank Approximation via Stochastic Gradient Descent on Manifolds

arXiv.org Machine Learning

We solve a regularized weighted low-rank approximation problem by a stochastic gradient descent on a manifold. To guarantee the convergence of our stochastic gradient descent, we establish a convergence theorem on manifolds for retraction-based stochastic gradient descents admitting confinements. On sample data from the Netflix Prize training dataset, our algorithm outperforms the existing stochastic gradient descent on Euclidean spaces. We also compare the accelerated line search on this manifold to the existing accelerated line search on Euclidean spaces.


The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent

arXiv.org Machine Learning

Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. While the study of multi-index models with Gaussian data in high dimensions has provided analytical insights into the benefits of GD-trained neural networks over kernels, the role of depth in improving sample complexity and generalization in GD-trained networks remains poorly understood. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.


Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

arXiv.org Machine Learning

In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early-stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early-stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early-stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early-stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.