AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Value-Function-based Sequential Minimization for Bi-level Optimization

Liu, Risheng, Liu, Xuan, Zeng, Shangzhi, Zhang, Jin, Zhang, Yixuan

arXiv.org Artificial IntelligenceMay-6-2023

Gradient-based Bi-Level Optimization (BLO) methods have been widely applied to handle modern learning tasks. However, most existing strategies are theoretically designed based on restrictive assumptions (e.g., convexity of the lower-level sub-problem), and computationally not applicable for high-dimensional tasks. Moreover, there are almost no gradient-based methods able to solve BLO in those challenging scenarios, such as BLO with functional constraints and pessimistic BLO. In this work, by reformulating BLO into approximated single-level problems, we provide a new algorithm, named Bi-level Value-Function-based Sequential Minimization (BVFSM), to address the above issues. Specifically, BVFSM constructs a series of value-function-based approximations, and thus avoids repeated calculations of recurrent gradient and Hessian inverse required by existing approaches, time-consuming especially for high-dimensional tasks. We also extend BVFSM to address BLO with additional functional constraints. More importantly, BVFSM can be used for the challenging pessimistic BLO, which has never been properly solved before. In theory, we prove the asymptotic convergence of BVFSM on these types of BLO, in which the restrictive lower-level convexity assumption is discarded. To our best knowledge, this is the first gradient-based algorithm that can solve different kinds of BLO (e.g., optimistic, pessimistic, and with constraints) with solid convergence guarantees. Extensive experiments verify the theoretical investigations and demonstrate our superiority on various real-world applications.

artificial intelligence, bvfsm, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2110.04974

Country:

North America > Canada > British Columbia > Vancouver Island > Capital Regional District > Victoria (0.14)
Asia > China > Liaoning Province > Dalian (0.04)
Asia > China > Hong Kong (0.04)
(6 more...)

Genre: Research Report (0.81)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

A Bootstrap Algorithm for Fast Supervised Learning

Kouritzin, Michael A, Styles, Stephen, Vritsiou, Beatrice-Helen

arXiv.org Artificial IntelligenceMay-4-2023

Training a neural network (NN) typically relies on some type of curve-following method, such as gradient descent (GD) (and stochastic gradient descent (SGD)), ADADELTA, ADAM or limited memory algorithms. Convergence for these algorithms usually relies on having access to a large quantity of observations in order to achieve a high level of accuracy and, with certain classes of functions, these algorithms could take multiple epochs of data points to catch on. Herein, a different technique with the potential of achieving dramatically better speeds of convergence, especially for shallow networks, is explored: it does not curve-follow but rather relies on 'decoupling' hidden layers and on updating their weighted connections through bootstrapping, resampling and linear regression. By utilizing resampled observations, the convergence of this process is empirically shown to be remarkably fast and to require a lower amount of data points: in particular, our experiments show that one needs a fraction of the observations that are required with traditional neural network training methods to approximate various classes of functions.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.03099

Country:

North America > Canada > Alberta (0.14)
Europe > Spain (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

A Stochastic Proximal Polyak Step Size

Schaipp, Fabian, Gower, Robert M., Ulbrich, Michael

arXiv.org Artificial IntelligenceMay-4-2023

Recently, the stochastic Polyak step size (SPS) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop ProxSPS, a proximal variant of SPS that can handle regularization terms. Developing a proximal variant of SPS is particularly important, since SPS requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, ProxSPS only requires a lower bound for the loss which is often readily available. As a consequence, we show that ProxSPS is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, ProxSPS performs as well as AdamW with little to no tuning, and results in a network with smaller weight parameters. We also provide an extensive convergence analysis for ProxSPS that includes the non-smooth, smooth, weakly convex and strongly convex setting.

artificial intelligence, machine learning research, proxsps, (14 more...)

arXiv.org Artificial Intelligence

2301.04935

Country:

North America > United States > New York (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

A Momentum-Incorporated Non-Negative Latent Factorization of Tensors Model for Dynamic Network Representation

Zeng, Aoling

arXiv.org Artificial IntelligenceMay-4-2023

Abstract--A large-scale dynamic network (LDN) is a source of data in many big data-related applications due to their large number of entities and large-scale dynamic interactions. They can be modeled as a high-dimensional incomplete (HDI) tensor that contains a wealth of knowledge about time patterns. A Latent factorization of tensors (LFT) model efficiently extracts this time pattern, which can be established using stochastic gradient descent (SGD) solvers. However, LFT models based on SGD are often limited by training schemes and have poor tail convergence. To solve this problem, this paper proposes a novel nonlinear LFT model (MNNL) based on momentum-incorporated SGD, which extracts non-negative latent factors from HDI tensors to make training unconstrained and compatible with general training schemes, while improving convergence accuracy and speed. Empirical studies on two LDN datasets show that compared to existing models, the MNNL model has higher prediction accuracy and convergence speed.

artificial intelligence, ieee transaction, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2305.02782

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.47)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

Understanding the Spectral Bias of Coordinate Based MLPs Via Training Dynamics

Lazzari, John, Liu, Xiuwen

arXiv.org Artificial IntelligenceMay-3-2023

Spectral bias is an important observation of neural network training, stating that the network will learn a low frequency representation of the target function before converging to higher frequency components. This property is interesting due to its link to good generalization in over-parameterized networks. However, in low dimensional settings, a severe spectral bias occurs that obstructs convergence to high frequency components entirely. In order to overcome this limitation, one can encode the inputs using a high frequency sinusoidal encoding. Previous works attempted to explain this phenomenon using Neural Tangent Kernel (NTK) and Fourier analysis. However, NTK does not capture real network dynamics, and Fourier analysis only offers a global perspective on the network properties that induce this bias. In this paper, we provide a novel approach towards understanding spectral bias by directly studying ReLU MLP training dynamics. Specifically, we focus on the connection between the computations of ReLU networks (activation regions), and the speed of gradient descent convergence. We study these dynamics in relation to the spatial information of the signal to understand how they influence spectral bias. We then use this formulation to study the severity of spectral bias in low dimensional settings, and how positional encoding overcomes this.

activation region, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2301.05816

Country: North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs

Zhang, Chengming, Smith, Shaden, Sun, Baixi, Tian, Jiannan, Soifer, Jonathan, Yu, Xiaodong, Song, Shuaiwen Leon, He, Yuxiong, Tao, Dingwen

arXiv.org Artificial IntelligenceMay-3-2023

Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2X speedup over existing CPU solution and 4.5X speedup and 7.9X cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.

artificial intelligence, machine learning, matrix, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3577193.3593717

2304.07334

Country:

North America > United States > Florida > Orange County > Orlando (0.05)
North America > United States > Washington > King County > Redmond (0.04)
North America > United States > Indiana > Monroe County > Bloomington (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.34)

Industry:

Energy (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Random Function Descent

Benning, Felix, Döring, Leif

arXiv.org Artificial IntelligenceMay-2-2023

While gradient based methods are ubiquitous in machine learning, selecting the right step size often requires "hyperparameter tuning". This is because backtracking procedures like Armijo's rule depend on quality evaluations in every step, which are not available in a stochastic context. Since optimization schemes can be motivated using Taylor approximations, we replace the Taylor approximation with the conditional expectation (the best $L^2$ estimator) and propose "Random Function Descent" (RFD). Under light assumptions common in Bayesian optimization, we prove that RFD is identical to gradient descent, but with calculable step sizes, even in a stochastic context. We beat untuned Adam in synthetic benchmarks. To close the performance gap to tuned Adam, we propose a heuristic extension competitive with tuned Adam.

artificial intelligence, machine learning, random function, (17 more...)

arXiv.org Artificial Intelligence

2305.01377

Country:

North America > Canada > Alberta (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry: Education > Educational Setting > Online (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback

Performance and Energy Consumption of Parallel Machine Learning Algorithms

Wu, Xidong, Brazzle, Preston, Cahoon, Stephen

arXiv.org Artificial IntelligenceMay-1-2023

Machine learning models have achieved remarkable success in various real-world applications such as data science, computer vision, and natural language processing. However, model training in machine learning requires large-scale data sets and multiple iterations before it can work properly. Parallelization of training algorithms is a common strategy to speed up the process of training. Power consumption is also an important metric for any type of computation, especially high-performance applications. Machine learning algorithms that can be used on low-power platforms such as sensors and mobile devices have been researched, but less power optimization is done for algorithms designed for high-performance computing. In this paper, we present a C++ implementation of logistic regression and the genetic algorithm, and a Python implementation of neural networks with stochastic gradient descent (SGD) algorithm on classification tasks. We will show the impact that the complexity of the model and the size of the training data have on the parallel efficiency of the algorithm in terms of both power and performance. We also tested these implementations using shard-memory parallelism, distributed memory parallelism, and GPU acceleration to speed up machine learning model training. Machine learning is a class of data-driven algorithms and models where models progressively improve as they gain experience. It has many applications from image classification to robot control [1]. By providing a set of training data, models can train themselves to accurately process new data outside of the training set.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2305.00798

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.37)

Industry:

Information Technology (0.46)
Energy (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.59)

Add feedback

ISAAC Newton: Input-based Approximate Curvature for Newton's Method

Petersen, Felix, Sutter, Tobias, Borgelt, Christian, Huh, Dongsung, Kuehne, Hilde, Sun, Yuekai, Deussen, Oliver

arXiv.org Artificial IntelligenceApr-30-2023

We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational overhead. The proposed method allows effective training even in small-batch stochastic regimes, which makes it competitive to first-order as well as second-order methods. While second-order optimization methods are traditionally much less explored than first-order methods in large-scale machine learning (ML) applications due to their memory requirements and prohibitive computational cost per iteration, they have recently become more popular in ML mainly due to their fast convergence properties when compared to first-order methods [1]. The expensive computation of an inverse Hessian (also known as pre-conditioning matrix) in the Newton step has also been tackled via estimating the curvature from the change in gradients.

approximation, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.00604

Country:

Europe > Austria > Salzburg > Salzburg (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Michigan (0.04)
(2 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

Add feedback

Further analysis of multilevel Stein variational gradient descent with an application to the Bayesian inference of glacier ice models

Alsup, Terrence, Hartland, Tucker, Peherstorfer, Benjamin, Petra, Noemi

arXiv.org Artificial IntelligenceApr-29-2023

Bayesian inference is a ubiquitous and flexible tool for updating a belief (i.e., learning) about a quantity of interest when data are observed, which ultimately can be used to inform downstream decision-making. In particular, Bayesian inverse problems allow one to derive knowledge from data through the lens of physicsbased models. These problems can be formulated as follows: given observational data, a physics-based model, and prior information about the model inputs, find a posterior probability distribution for the inputs that reflects the knowledge about the inputs in terms of the observed data and prior. Typically, the physicsbased models are given in the form of an input-to-observation map that is based on a system of partial differential equations (PDEs). The computational task underlying Bayesian inference is approximating posterior probability distributions to compute expectations and to quantify uncertainties. There are multiple ways of computationally exploring posterior distributions to gain insights, reaching from Markov chain Monte Carlo to variational methods [24, 42, 28]. In this work, we make use of Stein variational gradient descent (SVGD) [32], which is a method for particle-based variational inference, to approximate posterior distributions. It builds on Stein's identity to formulate an update step for the particles that can be realized numerically in an efficient manner via

artificial intelligence, bayesian inference, machine learning, (11 more...)

arXiv.org Artificial Intelligence

2212.03366

Country:

North America > United States > California > Merced County > Merced (0.14)
North America > United States > New York (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback