to

A Primer on PAC-Bayesian Learning

Generalized Bayesian learning algorithms are increasingly popular in machine learning, due to their PAC generalization properties and flexibility. The present paper aims at providing a self-contained survey on the resulting PAC-Bayes framework and some of its main theoretical and algorithmic developments.

PAC-Bayes under potentially heavy tails

Subsequent work developed finite-sample risk bounds for "Bayesian" learning algorithms which specify a distribution over the model [14]. These bounds are controlled using the empirical risk and the relative entropy between "prior" and "posterior" distributions, and hold uniformly over the choice of the latter, meaning that the guarantees hold for data-dependent posteriors, hence the naming. Furthermore, choosing the posterior to minimize PAC-Bayesian risk bounds leads to practical learning algorithms which have seen numerous successful applications [3]. Following this framework, a tremendous amount of work has been done to refine, extend, and apply the PAC-Bayesian framework to new learning problems. Tight risk bounds for bounded losses are due to Seeger [16] and Maurer [12], with the former work applying them to Gaussian processes.

Novel Change of Measure Inequalities and PAC-Bayesian Bounds

PAC-Bayesian theory has received a growing attention in the machine learning community. Our work extends the PAC-Bayesian theory by introducing several novel change of measure inequalities for two families of divergences: $f$-divergences and $\alpha$-divergences. First, we show how the variational representation for $f$-divergences leads to novel change of measure inequalities. Second, we propose a multiplicative change of measure inequality for $\alpha$-divergences, which leads to tighter bounds under some technical conditions. Finally, we present several PAC-Bayesian bounds for various classes of random variables, by using our novel change of measure inequalities.

PAC-Bayesian Inequalities for Martingales

We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application to importance weighted sampling, reinforcement learning, and other interactive learning domains, as well as many other domains in probability theory and statistics, where martingales are encountered. We also present a comparison inequality that bounds the expectation of a convex function of a martingale difference sequence shifted to the [0,1] interval by the expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma's inequality.

PAC-Bayesian Theory Meets Bayesian Inference

We exhibit a strong link between frequentist PAC-Bayesian bounds and the Bayesian marginal likelihood. That is, for the negative log-likelihood loss function, we show that the minimization of PAC-Bayesian generalization bounds maximizes the Bayesian marginal likelihood. This provides an alternative explanation to the Bayesian Occam's razor criteria, under the assumption that the data is generated by an i.i.d. distribution. Moreover, as the negative log-likelihood is an unbounded loss function, we motivate and propose a PAC-Bayesian theorem tailored for the sub-gamma loss family, and we show that our approach is sound on classical Bayesian linear regression tasks.