Natural Gradients and Stochastic Variational Inference

#artificialintelligence 

Examining the Fisher for diagonal Gaussians allows us to see exactly how the natural gradient differs from the standard gradient. If a dimension has small variance, then preconditioning by the inverse Fisher makes the natural gradient smaller along the mean dimension, . Intuitively, this makes sense -- we want our optimization routine to slow down when the component variance is small because small changes to the mean correspond to big changes in KL (see the two skinny Gaussians above), which can result in chatoic looking optimization traces. When the variance along a dimension is large, the inverse Fisher elongates the standard gradient along that dimension. Again, this makes sense -- when the component variance is high we can move the mean a lot farther (in Euclidean distance) without moving that far in terms of KL (see the two wide Gaussians above).