Interpreting a Penalty as the Influence of a Bayesian Prior

Wolinski, Pierre, Charpiat, Guillaume, Ollivier, Yann

arXiv.org Machine Learning 

For instance, penalties are used to improve generalization, prune neurons or reduce the rank of tensors of weights. Therefore, usual penalties are mostly empirical and user-defined, and integrated to the loss as follows: L( w) null( w) r (w), with w the vector of all parameters in the network, null( w) the error term and r (w) the penalty term. From a Bayesian point of view, optimizing such a loss L is equivalent to finding the Maximum A Posteriori (MAP) of the parameters w given the training data and a prior α exp( r). Indeed, assuming that the loss null is a log-likelihood loss, namely, null(w) ln p w( D) with dataset D, then minimizing L is equivalent to minimizing L MAP(w) ln p w(D) ln(α (w)). Thus, within the MAP framework, we can interpret the penalty term r as the influence of a prior α [14]. However, the MAP approximates the Bayesian posterior very roughly, by taking its maximum. Variational Inference (VI) provides a variational posterior distribution rather than a single value, hopefully representing the Bayesian posterior much better. VI looks for the best posterior approximation within a family β u(w) of approximate posteriors over w, parameterized Inria, Team TAU, Gif-sur-Yvette, France † Facebook, France 1 arXiv:2002.00178v1

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found