AITopics

2502.17725

Country:

Europe (0.14)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
North America > United States > Virginia > Loudoun County > Leesburg (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Jones-McCormick, Taj, Jagannath, Aukosh, Sen, Subhabrata

Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models

arXiv.org Machine LearningFeb-24-2025

Unsupervised pre-training and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.

initialization, sample complexity, theorem 3, (14 more...)

2502.16849

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Arous, Gerard Ben, Gheissari, Reza, Huang, Jiaoyang, Jagannath, Aukosh

Local geometry of high-dimensional mixture models: Effective spectral theory and dynamical transitions

arXiv.org Machine LearningFeb-21-2025

We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices. We focus on settings where the data, $(Y_\ell)_{\ell =1}^n\in \mathbb R^d$, are i.i.d. draws of a $k$-component Gaussian mixture model, and the loss depends on the projection of the data into a fixed number of vectors, namely $\mathbf{x}^\top Y$, where $\mathbf{x}\in \mathbb{R}^{d\times C}$ are the parameters, and $C$ need not equal $k$. This setting captures a broad class of problems such as classification by one and two-layer networks and regression on multi-index models. We prove exact formulas for the limits of the empirical spectral distribution and outlier eigenvalues and eigenvectors of such matrices in the proportional asymptotics limit, where the number of samples and dimension $n,d\to\infty$ and $n/d=\phi \in (0,\infty)$. These limits depend on the parameters $\mathbf{x}$ only through the summary statistic of the $(C+k)\times (C+k)$ Gram matrix of the parameters and class means, $\mathbf{G} = (\mathbf{x},\mathbf{\mu})^\top(\mathbf{x},\mathbf{\mu})$. It is known that under general conditions, when $\mathbf{x}$ is trained by stochastic gradient descent, the evolution of these same summary statistics along training converges to the solution of an autonomous system of ODEs, called the effective dynamics. This enables us to connect the spectral theory to the training dynamics. We demonstrate our general results by analyzing the effective spectrum along the effective dynamics in the case of multi-class logistic regression. In this setting, the empirical Hessian and information matrices have substantially different spectra, each with their own static and even dynamical spectral transitions.

matrix, outlier, theorem 1, (16 more...)

2502.15655

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
North America > United States > New York (0.04)
North America > Canada > Ontario (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

arXiv.org Artificial IntelligenceFeb-21-2025

Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

Cichocki, Andrzej

In this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) approaches, and applying as a regularization function the Bregman divergence with two-parameter deformation of logarithm as a link function. This link function (referred to as the Euler logarithm) is associated with a wide class of generalized entropies. In order to derive novel GEG/MD updates, we estimate generalized exponential function, which closely approximates the inverse of the Euler two-parameter logarithm. The characteristic/shape and properties of the Euler logarithm and its inverse -- deformed exponential functions are tuned by two or even more hyperparameters. By learning these hyperparameters, we can adapt to distribution of training data, and we can adjust them to achieve desired properties of gradient descent algorithms. The concept of generalized entropies and associated deformed logarithms provide deeper insight into novel gradient descent updates. In literature, there exist nowadays over fifty mathematically well-defined entropic functionals and associated deformed logarithms, so impossible to investigate all of them in one research paper. Therefore, we focus here on a wide-class of trace-form entropies and associated generalized logarithm. We applied the developed algorithms for Online Portfolio Selection (OPLS) in order to improve its performance and robustness.

algorithm, application, logarithm, (14 more...)

2502.175

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(5 more...)

Genre: Research Report (0.40)

Industry: Banking & Finance > Trading (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Heidari, Marzi, Guo, Yuhong

Single Domain Generalization with Model-aware Parametric Batch-wise Mixup

arXiv.org Artificial IntelligenceFeb-21-2025

Single Domain Generalization (SDG) remains a formidable challenge in the field of machine learning, particularly when models are deployed in environments that differ significantly from their training domains. In this paper, we propose a novel data augmentation approach, named as Model-aware Parametric Batch-wise Mixup (MPBM), to tackle the challenge of SDG. MPBM deploys adversarial queries generated with stochastic gradient Langevin dynamics, and produces model-aware augmenting instances with a parametric batch-wise mixup generator network that is carefully designed through an innovative attention mechanism. By exploiting inter-feature correlations, the parameterized mixup generator introduces additional versatility in combining features across a batch of instances, thereby enhancing the capacity to generate highly adaptive and informative synthetic instances for specific queries. The synthetic data produced by this adaptable generator network, guided by informative queries, is expected to significantly enrich the representation space covered by the original training dataset and subsequently enhance the prediction model's generalizability across diverse and previously unseen domains. To prevent excessive deviation from the training data, we further incorporate a real-data alignment-based adversarial loss into the learning process of MPBM, regularizing any tendencies toward undesirable expansions. We conduct extensive experiments on several benchmark datasets. The empirical results demonstrate that by augmenting the training set with informative synthesis data, our proposed MPBM method achieves the state-of-the-art performance for single domain generalization.

generalization, international conference, prediction model, (11 more...)

2502.16064

Country: North America > Canada > Ontario > National Capital Region > Ottawa (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

arXiv.org Machine LearningFeb-21-2025

Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks

Cai, Yuhang, Zhou, Kangjie, Wu, Jingfeng, Mei, Song, Lindsey, Michael, Bartlett, Peter L.

Deep networks often have an enormous amount of parameters an d are theoretically capable of overfitting the training data. However, in practice, deep networks trai ned via gradient descent (GD) or its variants often generalize well. This is commonly attributed to the implicit bias of GD, in which GD finds a certain solution that prevents overfitting ( Zhang et al., 2021; Neyshabur et al., 2017; Bartlett et al., 2021). Understanding the implicit bias of GD is one of the central topics in deep learni ng theory. The implicit bias of GD is relatively well-understood when t he network is homogeneous (see Soudry et al., 2018; Ji and Telgarsky, 2018; Lyu and Li, 2020; Ji and Telgarsky, 2020; Wu et al., 2023, and references therein). For linear networks trained on linearly separabl e data, GD diverges in norm while converging in direction to the maximum margin solution ( Soudry et al., 2018; Ji and Telgarsky, 2018; Wu et al., 2023). Similar results have been established for generic homogene ous networks that include a class of deep networks, assuming that the network at initialization can sepa rate the training data.

assumption 1, ji and telgarsky, nullxnull, (15 more...)

2502.16075

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)

Medyakov, Daniil, Molodtsov, Gleb, Chezhegov, Savelii, Rebrikov, Alexey, Beznosikov, Aleksandr

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

arXiv.org Artificial IntelligenceFeb-20-2025

In today's world, machine learning is hard to imagine without large training datasets and models. This has led to the use of stochastic methods for training, such as stochastic gradient descent (SGD). SGD provides weak theoretical guarantees of convergence, but there are modifications, such as Stochastic Variance Reduced Gradient (SVRG) and StochAstic Recursive grAdient algoritHm (SARAH), that can reduce the variance. These methods require the computation of the full gradient occasionally, which can be time consuming. In this paper, we explore variants of variance reduction algorithms that eliminate the need for full gradient computations. To make our approach memory-efficient and avoid full gradient computations, we use two key techniques: the shuffling heuristic and idea of SAG/SAGA methods. As a result, we improve existing estimates for variance reduction algorithms without the full gradient computations. Additionally, for the non-convex objective function, our estimate matches that of classic shuffling methods, while for the strongly convex one, it is an improvement. We conduct comprehensive theoretical analysis and provide extensive experimental results to validate the efficiency and practicality of our methods for large-scale machine learning problems.

algorithm, assumption 1, gradient, (15 more...)

2502.14648

Country: Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

arXiv.org Machine LearningFeb-19-2025

Weighted Low-rank Approximation via Stochastic Gradient Descent on Manifolds

Xu, Conglong, Yang, Peiqi, Wu, Hao

We solve a regularized weighted low-rank approximation problem by a stochastic gradient descent on a manifold. To guarantee the convergence of our stochastic gradient descent, we establish a convergence theorem on manifolds for retraction-based stochastic gradient descents admitting confinements. On sample data from the Netflix Prize training dataset, our algorithm outperforms the existing stochastic gradient descent on Euclidean spaces. We also compare the accelerated line search on this manifold to the existing accelerated line search on Euclidean spaces.

equation, problem 1, sequence, (13 more...)

2502.14174

Country:

North America > United States > District of Columbia > Washington (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.40)

Industry:

Media > Film (0.34)
Leisure & Entertainment (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Dandi, Yatin, Pesce, Luca, Zdeborová, Lenka, Krzakala, Florent

The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent

arXiv.org Machine LearningFeb-19-2025

Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. While the study of multi-index models with Gaussian data in high dimensions has provided analytical insights into the benefits of GD-trained neural networks over kernels, the role of depth in improving sample complexity and generalization in GD-trained networks remains poorly understood. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.

arxiv preprint arxiv, denote, neural network, (14 more...)

2502.13961

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

arXiv.org Machine LearningFeb-18-2025

Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

Wu, Jingfeng, Bartlett, Peter, Telgarsky, Matus, Yu, Bin

In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early-stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early-stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early-stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early-stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.

exp, logistic regression, regression, (15 more...)

2502.13283

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)