AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Comparing Federated Stochastic Gradient Descent and Federated Averaging for Predicting Hospital Length of Stay

arXiv.org Artificial IntelligenceJul-17-2024

Predicting hospital length of stay (LOS) reliably is an essential need for efficient resource allocation at hospitals. Traditional predictive modeling tools frequently have difficulty acquiring sufficient and diverse data because healthcare institutions have privacy rules in place. In our study, we modeled this problem as an empirical graph where nodes are the hospitals. This modeling approach facilitates collaborative model training by modeling decentralized data sources from different hospitals without extracting sensitive data outside of hospitals. A local model is trained on a node (hospital) by aiming the generalized total variation minimization (GTVMin). Moreover, we implemented and compared two different federated learning optimization algorithms named federated stochastic gradient descent (FedSGD) and federated averaging (FedAVG). Our results show that federated learning enables accurate prediction of hospital LOS while addressing privacy concerns without extracting data outside healthcare institutions.

federated stochastic gradient descent, gradient descent and federated averaging, hospital length

arXiv.org Artificial Intelligence

2407.12741

Genre: Research Report > New Finding (0.87)

Industry: Health & Medicine > Health Care Providers & Services (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality

Chakrabarti, Kushal, Baranwal, Mayank

arXiv.org Machine LearningJul-17-2024

Adaptive gradient-descent optimizers are the standard choice for training neural network models. Despite their faster convergence than gradient-descent and remarkable performance in practice, the adaptive optimizers are not as well understood as vanilla gradient-descent. A reason is that the dynamic update of the learning rate that helps in faster convergence of these methods also makes their analysis intricate. Particularly, the simple gradient-descent method converges at a linear rate for a class of optimization problems, whereas the practically faster adaptive gradient methods lack such a theoretical guarantee. The Polyak-{\L}ojasiewicz (PL) inequality is the weakest known class, for which linear convergence of gradient-descent and its momentum variants has been proved. Therefore, in this paper, we prove that AdaGrad and Adam, two well-known adaptive gradient methods, converge linearly when the cost function is smooth and satisfies the PL inequality. Our theoretical framework follows a simple and unified approach, applicable to both batch and stochastic gradients, which can potentially be utilized in analyzing linear convergence of other variants of Adam.

adaptive gradient method, linear convergence, pl inequality

arXiv.org Machine Learning

2407.12629

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

In-Context Probing Approximates Influence Function for Data Valuation

Jiao, Cathy, Gao, Gary, Xiong, Chenyan

arXiv.org Artificial IntelligenceJul-16-2024

Data valuation quantifies the value of training data, and is used for data attribution (i.e., determining the contribution of training data towards model predictions), and data selection; both of which are important for curating high-quality datasets to train large language models. In our paper, we show that data valuation through in-context probing (i.e., prompting a LLM) approximates influence functions for selecting training data. We provide a theoretical sketch on this connection based on transformer models performing "implicit" gradient descent on its in-context inputs. Our empirical findings show that in-context probing and gradient-based influence frameworks are similar in how they rank training data. Furthermore, fine-tuning experiments on data selected by either method reveal similar model performance.

arxiv, infl ip, preprint, (14 more...)

arXiv.org Artificial Intelligence

2407.12259

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback

Correlations Are Ruining Your Gradient Descent

Ahmad, Nasir

arXiv.org Artificial IntelligenceJul-15-2024

Herein the topics of (natural) gradient descent, data decorrelation, and approximate methods for backpropagation are brought into a dialogue. Natural gradient descent illuminates how gradient vectors, pointing at directions of steepest descent, can be improved by considering the local curvature of loss landscapes. We extend this perspective and show that to fully solve the problem illuminated by natural gradients in neural networks, one must recognise that correlations in the data at any linear transformation, including node responses at every layer of a neural network, cause a non-orthonormal relationship between the model's parameters. To solve this requires a solution to decorrelate inputs at each individual layer of a neural network. We describe a range of methods which have been proposed for decorrelation and whitening of node output, while providing a novel method specifically useful for distributed computing and computational neuroscience. Implementing decorrelation within multi-layer neural networks, we can show that not only is training via backpropagation sped up significantly but also existing approximations of backpropagation, which have failed catastrophically in the past, are made performant once more. This has the potential to provide a route forward for approximate gradient descent methods which have previously been discarded, training approaches for analogue and neuromorphic hardware, and potentially insights as to the efficacy and utility of decorrelation processes in the brain.

correlation, gradient descent, neural network, (14 more...)

arXiv.org Artificial Intelligence

2407.1078

Country:

Europe > Netherlands > Gelderland > Nijmegen (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

Deutel, Mark, Hannig, Frank, Mutschler, Christopher, Teich, Jürgen

arXiv.org Artificial IntelligenceJul-15-2024

On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training of DNNs for Cortex-M MCUs. We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.

dataset, dnn, tensor, (17 more...)

arXiv.org Artificial Intelligence

2407.10734

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Setting > Online (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Add feedback

Diffusion Tempering Improves Parameter Estimation with Probabilistic Integrators for Ordinary Differential Equations

Beck, Jonas, Bosch, Nathanael, Deistler, Michael, Kadhim, Kyra L., Macke, Jakob H., Hennig, Philipp, Berens, Philipp

arXiv.org Artificial IntelligenceJul-15-2024

Ordinary differential equations (ODEs) are widely used to describe dynamical systems in science, but identifying parameters that explain experimental measurements is challenging. In particular, although ODEs are differentiable and would allow for gradient-based parameter optimization, the nonlinear dynamics of ODEs often lead to many local minima and extreme sensitivity to initial conditions. We therefore propose diffusion tempering, a novel regularization technique for probabilistic numerical methods which improves convergence of gradient-based parameter optimization in ODEs. By iteratively reducing a noise parameter of the probabilistic integrator, the proposed method converges more reliably to the true parameters. We demonstrate that our method is effective for dynamical systems of different complexity and show that it obtains reliable parameter estimates for a Hodgkin-Huxley model with a practically relevant number of parameters.

diffusion, optimization, parameter estimation, (14 more...)

arXiv.org Artificial Intelligence

2402.12231

Country:

Europe > Austria > Vienna (0.14)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

Yang, Yingzhen

arXiv.org Machine LearningJul-15-2024

We consider nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $\cO({1}/{n^{4\alpha/(4\alpha+1)}})$, which is sharper the current standard rate of $\cO({1}/{n^{2\alpha/(2\alpha+1)}})$ with $2\alpha = d/(d-1)$ when the data is distributed uniformly on the unit sphere in $\RR^d$ and $n$ is the size of the training data. When the target function has no spectral bias, we prove that neural network trained with regular GD with early stopping still enjoys minimax optimal rate, and in this case our results do not require distributional assumptions in contrast with the current known results. Our results are built upon two significant technical contributions. First, uniform convergence to the NTK is established during the training process by PGD or GD, so that we can have a nice decomposition of the neural network function at any step of GD or PGD into a function in the RKHS and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD or PGD. Our results also indicate that PGD can be another way of avoiding the usual linear regime of NTK and obtaining sharper generalization bound, because PGD induces a different kernel with lower kernel complexity during the training than the regular NTK induced by the network architecture trained by regular GD.

int, neural network, probability, (14 more...)

arXiv.org Machine Learning

2407.11353

Country:

North America > United States > Arizona > Maricopa County > Tempe (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.74)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

Towards stable training of parallel continual learning

Yuepan, Li, Lyu, Fan, Li, Yuyang, Feng, Wei, Liu, Guangcan, Shang, Fanhua

arXiv.org Artificial IntelligenceJul-11-2024

Parallel Continual Learning (PCL) tasks investigate the training methods for continual learning with multi-source input, where data from different tasks are learned as they arrive. PCL offers high training efficiency and is well-suited for complex multi-source data systems, such as autonomous vehicles equipped with multiple sensors. However, at any time, multiple tasks need to be trained simultaneously, leading to severe training instability in PCL. This instability manifests during both forward and backward propagation, where features are entangled and gradients are conflict. This paper introduces Stable Parallel Continual Learning (SPCL), a novel approach that enhances the training stability of PCL for both forward and backward propagation. For the forward propagation, we apply Doubly-block Toeplit (DBT) Matrix based orthogonality constraints to network parameters to ensure stable and consistent propagation. For the backward propagation, we employ orthogonal decomposition for gradient management stabilizes backpropagation and mitigates gradient conflicts across tasks. By optimizing gradients by ensuring orthogonality and minimizing the condition number, SPCL effectively stabilizing the gradient descent in complex optimization tasks. Experimental results demonstrate that SPCL outperforms state-of-the-art methjods and achieve better training stability.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2407.08214

Country:

North America > United States (0.14)
Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Faster Machine Unlearning via Natural Gradient Descent

Lev, Omri, Wilson, Ashia

arXiv.org Artificial IntelligenceJul-11-2024

We address the challenge of efficiently and reliably deleting data from machine learning models trained using Empirical Risk Minimization (ERM), a process known as machine unlearning. To avoid retraining models from scratch, we propose a novel algorithm leveraging Natural Gradient Descent (NGD). Our theoretical framework ensures strong privacy guarantees for convex models, while a practical Min/Max optimization algorithm is developed for non-convex models. Comprehensive evaluations show significant improvements in privacy, computational efficiency, and generalization compared to state-of-the-art methods, advancing both the theoretical and practical aspects of machine unlearning.

algorithm, loss function, neural information processing system, (14 more...)

arXiv.org Artificial Intelligence

2407.08169

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California (0.04)
North America > Canada > Ontario > Toronto (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Generalization Error Matters in Decentralized Learning Under Byzantine Attacks

Ye, Haoxiang, Ling, Qing

arXiv.org Artificial IntelligenceJul-11-2024

Recently, decentralized learning has emerged as a popular peer-to-peer signal and information processing paradigm that enables model training across geographically distributed agents in a scalable manner, without the presence of any central server. When some of the agents are malicious (also termed as Byzantine), resilient decentralized learning algorithms are able to limit the impact of these Byzantine agents without knowing their number and identities, and have guaranteed optimization errors. However, analysis of the generalization errors, which are critical to implementations of the trained models, is still lacking. In this paper, we provide the first analysis of the generalization errors for a class of popular Byzantine-resilient decentralized stochastic gradient descent (DSGD) algorithms. Our theoretical results reveal that the generalization errors cannot be entirely eliminated because of the presence of the Byzantine agents, even if the number of training samples are infinitely large. Numerical experiments are conducted to confirm our theoretical results.

agent, generalization error, training sample, (10 more...)

arXiv.org Artificial Intelligence

2407.08632

Country: Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)

Add feedback