hessian-vector product
Reducing Reparameterization Gradient Variance
Andrew Miller, Nick Foti, Alexander D'Amour, Ryan P. Adams
Optimization with noisy gradients has become ubiquitous in statistics and machine learning. Reparameterization gradients, or gradient estimates computed via the "reparameterization trick," represent a class of noisy gradients often used in Monte Carlo variational inference (MCVI). However, when these gradient estimators are too noisy, the optimization procedure can be slow or fail to converge. One way to reduce noise is to generate more samples for the gradient estimate, but this can be computationally expensive. Instead, we view the noisy gradient as a random variable, and form an inexpensive approximation of the generating procedure for the gradient sample. This approximation has high correlation with the noisy gradient by construction, making it a useful control variate for variance reduction. We demonstrate our approach on a non-conjugate hierarchical model and a Bayesian neural net where our method attained orders of magnitude (20-2,000) reduction in gradient variance resulting in faster and more stable optimization.
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Iowa > Johnson County > Iowa City (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
NEON2: Finding Local Minima via First-Order Oracles
We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its theoretical performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- (3 more...)
- North America > United States > Iowa > Johnson County > Iowa City (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
NEON2: Finding Local Minima via First-Order Oracles
We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its theoretical performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- (3 more...)
A Algorithms
Below we include detailed pseudocode for algorithms described in the main text.Algorithm 2 Parameter Free DeltaShift Input: Implicit matrix-vector multiplication access to A In this section, we give a full proof of Theorem 1.1 with the correct logarithmic dependence on Before doing so, we collect several definitions and results required for proving the theorem. As discussed, a tight analysis of Hutchinson's estimator, and also our DeltaShift algorithm, relies Finally, from Claim B.2, we immediately have Rademacher random vectors, a similar analysis can be performed for any i.i.d. Now, we are ready to move on to the main result. The proof is by induction. We claim that, for all j = 1,...,m, t Next consider the inductive case.
- North America > United States > California > Yolo County > Davis (0.05)
- North America > Canada (0.05)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Yolo County > Davis (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)