AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Safe Adaptive Importance Sampling

Sebastian U. Stich, Anant Raj, Martin Jaggi

Neural Information Processing SystemsOct-2-2024, 18:55:47 GMT

Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants--using importance values defined by the complete gradient information which changes during optimization--enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efficient approximation of gradient-based sampling, which is based on safe bounds on the gradient. The proposed sampling distribution is (i) provably the best sampling with respect to the given bounds, (ii) always better than uniform sampling and fixed importance sampling and (iii) can efficiently be computed--in many applications at negligible extra cost. The proposed sampling scheme is generic and can easily be integrated into existing algorithms. In particular, we show that coordinate-descent (CD) and stochastic gradient descent (SGD) can enjoy significant a speed-up under the novel scheme. The proven efficiency of the proposed sampling is verified by extensive numerical testing.

artificial intelligence, machine learning, optimization problem, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > New York (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Deep Hyperalignment

Muhammad Yousefnezhad, Daoqiang Zhang

Neural Information Processing SystemsOct-2-2024, 16:37:34 GMT

This paper proposes Deep Hyperalignment (DHA) as a regularized, deep extension, scalable Hyperalignment (HA) method, which is well-suited for applying functional alignment to fMRI datasets with nonlinearity, high-dimensionality (broad ROI), and a large number of subjects. Unlink previous methods, DHA is not limited by a restricted fixed kernel function. Further, it uses a parametric approach, rank-m Singular Value Decomposition (SVD), and stochastic gradient descent for optimization. Therefore, DHA has a suitable time complexity for large datasets, and DHA does not require the training data when it computes the functional alignment for a new subject. Experimental studies on multi-subject fMRI analysis confirm that the DHA method achieves superior performance to other state-of-the-art HA algorithms.

alignment, dataset, functional alignment, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(4 more...)

Industry:

Health & Medicine > Health Care Technology (0.58)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Fabian Pedregosa, Rémi Leblond, Simon Lacoste-Julien

Neural Information Processing SystemsOct-2-2024, 16:35:22 GMT

Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints.

algorithm, gradient, saga, (14 more...)

Neural Information Processing Systems

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

On the SAGA algorithm with decreasing step

Fredes, Luis, Bercu, Bernard, Gbaguidi, Eméric

arXiv.org Machine LearningOct-2-2024

Stochastic optimization naturally appear in many application areas, including machine learning. Our goal is to go further in the analysis of the Stochastic Average Gradient Accelerated (SAGA) algorithm. To achieve this, we introduce a new $\lambda$-SAGA algorithm which interpolates between the Stochastic Gradient Descent ($\lambda=0$) and the SAGA algorithm ($\lambda=1$). Firstly, we investigate the almost sure convergence of this new algorithm with decreasing step which allows us to avoid the restrictive strong convexity and Lipschitz gradient hypotheses associated to the objective function. Secondly, we establish a central limit theorem for the $\lambda$-SAGA algorithm. Finally, we provide the non-asymptotic $\mathbb{L}^p$ rates of convergence.

algorithm, convergence, saga algorithm, (13 more...)

arXiv.org Machine Learning

2410.0376

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > France (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Universality in Transfer Learning for Linear Models

Ghane, Reza, Akhtiamov, Danil, Hassibi, Babak

arXiv.org Machine LearningOct-2-2024

Transfer learning is an attractive framework for problems where there is a paucity of data, or where data collection is costly. One common approach to transfer learning is referred to as "model-based", and involves using a model that is pretrained on samples from a source distribution, which is easier to acquire, and then fine-tuning the model on a few samples from the target distribution. The hope is that, if the source and target distributions are ``close", then the fine-tuned model will perform well on the target distribution even though it has seen only a few samples from it. In this work, we study the problem of transfer learning in linear models for both regression and binary classification. In particular, we consider the use of stochastic gradient descent (SGD) on a linear model initialized with pretrained weights and using a small training data set from the target distribution. In the asymptotic regime of large models, we provide an exact and rigorous analysis and relate the generalization errors (in regression) and classification errors (in binary classification) for the pretrained and fine-tuned models. In particular, we give conditions under which the fine-tuned model outperforms the pretrained one. An important aspect of our work is that all the results are "universal", in the sense that they depend only on the first and second order statistics of the target distribution. They thus extend well beyond the standard Gaussian assumptions commonly made in the literature.

assumption, matrix, universality, (15 more...)

arXiv.org Machine Learning

2410.02164

Country:

North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Greece (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Review Non-convex Optimization Method for Machine Learning

Fotopoulos, Greg B, Popovich, Paul, Papadopoulos, Nicholas Hall

arXiv.org Artificial IntelligenceOct-2-2024

Non-convex optimization is a critical tool in advancing machine learning, especially for complex models like deep neural networks and support vector machines. Despite challenges such as multiple local minima and saddle points, non-convex techniques offer various pathways to reduce computational costs. These include promoting sparsity through regularization, efficiently escaping saddle points, and employing subsampling and approximation strategies like stochastic gradient descent. Additionally, non-convex methods enable model pruning and compression, which reduce the size of models while maintaining performance. By focusing on good local minima instead of exact global minima, non-convex optimization ensures competitive accuracy with faster convergence and lower computational overhead. This paper examines the key methods and applications of non-convex optimization in machine learning, exploring how it can lower computation costs while enhancing model performance. Furthermore, it outlines future research directions and challenges, including scalability and generalization, that will shape the next phase of non-convex optimization in machine learning.

machine learning, review non-convex optimization method

arXiv.org Artificial Intelligence

2410.02017

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

Add feedback

NTK-DFL: Enhancing Decentralized Federated Learning in Heterogeneous Settings via Neural Tangent Kernel

Thompson, Gabriel, Yue, Kai, Wong, Chau-Wai, Dai, Huaiyu

arXiv.org Artificial IntelligenceOct-2-2024

Decentralized federated learning (DFL) is a collaborative machine learning framework for training a model across participants without a central server or raw data exchange. DFL faces challenges due to statistical heterogeneity, as participants often possess different data distributions reflecting local environments and user behaviors. Recent work has shown that the neural tangent kernel (NTK) approach, when applied to federated learning in a centralized framework, can lead to improved performance. The NTK-based update mechanism is more expressive than typical gradient descent methods, enabling more efficient convergence and better handling of data heterogeneity. We propose an approach leveraging the NTK to train client models in the decentralized setting, while introducing a synergy between NTK-based evolution and model averaging. This synergy exploits inter-model variance and improves both accuracy and convergence in heterogeneous settings. Our model averaging technique significantly enhances performance, boosting accuracy by at least 10% compared to the mean local model accuracy. Empirical results demonstrate that our approach consistently achieves higher accuracy than baselines in highly heterogeneous settings, where other approaches often underperform. Additionally, it reaches target performance in 4.6 times fewer communication rounds. We validate our approach across multiple datasets, network topologies, and heterogeneity settings to ensure robustness and generalizability.

accuracy, heterogeneity, ntk-dfl, (13 more...)

arXiv.org Artificial Intelligence

2410.01922

Country: North America > United States > Virginia (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Extending Contextual Self-Modulation: Meta-Learning Across Modalities, Task Dimensionalities, and Data Regimes

Nzoyem, Roussel Desmond, Barton, David A. W., Deakin, Tom

arXiv.org Artificial IntelligenceOct-2-2024

Contextual Self-Modulation (CSM) is a potent regularization mechanism for the Neural Context Flow (NCF) framework which demonstrates powerful metalearning of physical systems. However, CSM has limitations in its applicability across different modalities and in high-data regimes. In this work, we introduce two extensions: iCSM, which expands CSM to infinite-dimensional tasks, and StochasticNCF, which improves scalability. These extensions are demonstrated through comprehensive experimentation on a range of tasks, including dynamical systems with parameter variations, computer vision challenges, and curve fitting problems. StochasticNCF enables the application of both CSM and iCSM to high-data scenarios by providing an unbiased approximation of meta-gradient updates through a sampled set of nearest environments. Additionally, we incorporate higher-order Taylor expansions via Taylor-Mode automatic differentiation, revealing that higher-order approximations do not necessarily enhance generalization. Finally, we demonstrate how CSM can be integrated into other meta-learning frameworks with FlashCAVIA, a computationally efficient extension of the CAVIA meta-learning framework (Zintgraf et al. 2019). FlashCAVIA outperforms its predecessor across various benchmarks and reinforces the utility of bi-level optimization techniques. Together, these contributions establish a robust framework for tackling an expanded spectrum of meta-learning tasks, offering practical insights for out-of-distribution generalization. Our opensourced library, designed for flexible integration of self-modulation into contextual meta-learning workflows, is available at github.com/ddrous/self-mod. Meta-learning has emerged as a powerful paradigm in machine learning, addressing the limitations of conventional approaches that train a single algorithm for a specific task. This innovative technique aims to develop models capable of rapid adaptation to novel but related tasks with minimal data, a process often referred to as "learning to learn" (Wang et al., 2021). By leveraging common information across multiple training environments (or meta-knowledge), meta-learning algorithms can efficiently adapt to new scenarios without starting from scratch (Hospedales et al., 2021). The success of meta-learning has been demonstrated in various domains, including dynamical system reconstruction (Norcliffe et al., 2021), program induction (Devlin et al., 2017), out-of-distribution (OoD) generalization (Yao et al., 2021), and continual learning (Hurtado et al., 2021).

dynamical system, experiment, ncf, (12 more...)

arXiv.org Artificial Intelligence

2410.01655

Country:

Europe > United Kingdom > England > Bristol (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

On the Convergence of FedProx with Extrapolation and Inexact Prox

Li, Hanmin, Richtárik, Peter

arXiv.org Artificial IntelligenceOct-2-2024

Enhancing the FedProx federated learning algorithm (Li et al., 2020) with server-side extrapolation, Li et al. (2024a) recently introduced the FedExProx method. Their theoretical analysis, however, relies on the assumption that each client computes a certain proximal operator exactly, which is impractical since this is virtually never possible to do in real settings. In this paper, we investigate the behavior of FedExProx without this exactness assumption in the smooth and globally strongly convex setting. We establish a general convergence result, showing that inexactness leads to convergence to a neighborhood of the solution. Additionally, we demonstrate that, with careful control, the adverse effects of this inexactness can be mitigated. By linking inexactness to biased compression (Beznosikov et al., 2023), we refine our analysis, highlighting robustness of extrapolation to inexact proximal updates. We also examine the local iteration complexity required by each client to achieved the required level of inexactness using various local optimizers. Our theoretical insights are validated through comprehensive numerical experiments.

algorithm, approximation, fedexprox, (15 more...)

arXiv.org Artificial Intelligence

2410.0141

Country:

Asia > Middle East > Saudi Arabia > Mecca Province > Thuwal (0.04)
North America > United States > Virginia (0.04)
Europe > Russia (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Add feedback

Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks

Chen, Ke, Yi, Chugang, Yang, Haizhao

arXiv.org Machine LearningOct-2-2024

We study the implicit bias towards low-rank weight matrices when training neural networks (NN) with Weight Decay (WD). We prove that when a ReLU NN is sufficiently trained with Stochastic Gradient Descent (SGD) and WD, its weight matrix is approximately a rank-two matrix. Empirically, we demonstrate that WD is a necessary condition for inducing this low-rank bias across both regression and classification tasks. Our work differs from previous studies as our theoretical analysis does not rely on common assumptions regarding the training data distribution, optimality of weight matrices, or specific training procedures. Furthermore, by leveraging the low-rank bias, we derive improved generalization error bounds and provide numerical evidence showing that better generalization can be achieved. Thus, our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.

generalization error, low-rank bias, neural network, (13 more...)

arXiv.org Machine Learning

2410.02176

Country:

North America > United States > California (0.05)
North America > United States > Maryland > Prince George's County > College Park (0.04)
North America > United States > Delaware > New Castle County > Newark (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback