AITopics

Country:

North America > United States > California (0.14)
North America > United States > Connecticut > New Haven County > New Haven (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)

Gradient Descent: Second Order Momentum and Saturating Error

Pearlmutter, Barak

We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics. 1 INTRODUCTION Gradient descent is the bread-and-butter optimization technique in neural networks. Some people build special purpose hardware to accelerate gradient descent optimization of backpropagation networks. Understanding the dynamics of gradient descent on such surfaces is therefore of great practical value. Here we briefly review the known results in the convergence of batch gradient descent; show that second-order momentum does not give any speedup; simulate a real network and observe some effect not predicted by theory; and account for these effects by analyzing gradient descent with momentum on a saturating error surface.

convergence, gradient descent, momentum, (11 more...)

Country: North America > United States > Connecticut > New Haven County > New Haven (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Darken, Christian, Moody, John

Towards Faster Stochastic Gradient Search

Stochastic gradient descent is a general algorithm which includes LMS, online backpropagation, and adaptive k-means clustering as special cases.

converge, convergence, gradient descent, (12 more...)

Country:

North America > United States > California (0.14)
North America > United States > Connecticut > New Haven County > New Haven (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)

Gradient Descent: Second Order Momentum and Saturating Error

Pearlmutter, Barak

We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics. 1 INTRODUCTION Gradient descent is the bread-and-butter optimization technique in neural networks. Some people build special purpose hardware to accelerate gradient descent optimization of backpropagation networks. Understanding the dynamics of gradient descent on such surfaces is therefore of great practical value. Here we briefly review the known results in the convergence of batch gradient descent; show that second-order momentum does not give any speedup; simulate a real network and observe some effect not predicted by theory; and account for these effects by analyzing gradient descent with momentum on a saturating error surface.

convergence, gradient descent, momentum, (11 more...)

Country: North America > United States > Connecticut > New Haven County > New Haven (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Darken, Christian, Moody, John

Towards Faster Stochastic Gradient Search

Stochastic gradient descent is a general algorithm which includes LMS, online backpropagation, and adaptive k-means clustering as special cases.

artificial intelligence, converge, machine learning, (14 more...)

Country: North America > United States > California (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)

Gradient Descent: Second Order Momentum and Saturating Error

Pearlmutter, Barak

We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics. 1 INTRODUCTION Gradient descent is the bread-and-butter optimization technique in neural networks. Some people build special purpose hardware to accelerate gradient descent optimization ofbackpropagation networks. Understanding the dynamics of gradient descent on such surfaces is therefore of great practical value. Here we briefly review the known results in the convergence of batch gradient descent; showthat second-order momentum does not give any speedup; simulate a real network and observe some effect not predicted by theory; and account for these effects by analyzing gradient descent with momentum on a saturating error surface.

artificial intelligence, convergence, machine learning, (12 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Darken, Christian, Moody, John E.

Note on Learning Rate Schedules for Stochastic Optimization

We present and compare learning rate schedules for stochastic gradient descent, a general algorithm which includes LMS, online backpropagation and k-means clustering as special cases. We introduce "search-thenconverge" type schedules which outperform the classical constant and "running average" (1ft) schedules both in speed of convergence and quality of solution.

algorithm, exemplar, learning rate schedule, (10 more...)

Country:

North America > United States > California (0.14)
North America > United States > Connecticut > New Haven County > New Haven (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.38)

Leaning by Combining Memorization and Gradient Descent

Platt, John C.

We have created a radial basis function network that allocates a new computational unit whenever an unusual pattern is presented to the network. The network learns by allocating new units and adjusting the parameters of existing units. If the network performs poorly on a presented pattern, then a new unit is allocated which memorizes the response to the presented pattern. If the network performs well on a presented pattern, then the network parameters are updated using standard LMS gradient descent. For predicting the Mackey Glass chaotic time series, our network learns much faster than do those using back-propagation and uses a comparable number of synapses.

compact representation, gradient descent, representation, (14 more...)

Country:

North America > United States > New Mexico > Los Alamos County > Los Alamos (0.05)
North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.77)

Darken, Christian, Moody, John E.

Note on Learning Rate Schedules for Stochastic Optimization

We present and compare learning rate schedules for stochastic gradient descent, a general algorithm which includes LMS, online backpropagation and k-means clustering as special cases. We introduce "search-thenconverge" type schedules which outperform the classical constant and "running average" (1ft) schedules both in speed of convergence and quality of solution.

algorithm, exemplar, learning rate schedule, (10 more...)

Country:

North America > United States > California (0.14)
North America > United States > Connecticut > New Haven County > New Haven (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.38)

Leaning by Combining Memorization and Gradient Descent

Platt, John C.

We have created a radial basis function network that allocates a new computational unit whenever an unusual pattern is presented to the network. The network learns by allocating new units and adjusting the parameters of existing units. If the network performs poorly on a presented pattern, then a new unit is allocated which memorizes the response to the presented pattern. If the network performs well on a presented pattern, then the network parameters are updated using standard LMS gradient descent. For predicting the Mackey Glass chaotic time series, our network learns much faster than do those using back-propagation and uses a comparable number of synapses.

compact representation, gradient descent, representation, (14 more...)