Germain, Pascal, Habrard, Amaury, Laviolette, François, Morvant, Emilie

In this paper, we provide two main contributions in PAC-Bayesian theory for domain adaptation where the objective is to learn, from a source distribution, a well-performing majority vote on a different target distribution. On the one hand, we propose an improvement of the previous approach proposed by Germain et al. (2013), that relies on a novel distribution pseudodistance based on a disagreement averaging, allowing us to derive a new tighter PAC-Bayesian domain adaptation bound for the stochastic Gibbs classifier. We specialize it to linear classifiers, and design a learning algorithm which shows interesting results on a synthetic problem and on a popular sentiment annotation task. On the other hand, we generalize these results to multisource domain adaptation allowing us to take into account different source domains. This study opens the door to tackle domain adaptation tasks by making use of all the PAC-Bayesian tools.

Herbrich, Ralf, Graepel, Thore

We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to nontrivial bound values and - for maximum margins - to a vanishing complexity term.Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1 Introduction Linear classifiers are exceedingly popular in the machine learning community due to their straightforward applicability and high flexibility which has recently been boosted by the so-called kernel methods [13]. A natural and popular framework for the theoretical analysis of classifiers is the PAC (probably approximately correct) framework[11] which is closely related to Vapnik's work on the generalisation error [12]. For binary classifiers it turned out that the growth function is an appropriate measureof "complexity" and can tightly be upper bounded by the VC (Vapnik-Chervonenkis) dimension [14].

Germain, Pascal, Habrard, Amaury, Laviolette, François, Morvant, Emilie

In machine learning, Domain Adaptation (DA) arises when the distribution gen- erating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is an hard task even under strong assumptions, among which the covariate-shift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another popular approach is to consider an hypothesis class that moves closer the two distributions while implying a low-error for both tasks. This is a VC-dim approach that restricts the complexity of an hypothesis class in order to get good generalization. Instead, we propose a PAC-Bayesian approach that seeks for suitable weights to be given to each hypothesis in order to build a majority vote. We prove a new DA bound in the PAC-Bayesian context. This leads us to design the first DA-PAC-Bayesian algorithm based on the minimization of the proposed bound. Doing so, we seek for a \rho-weighted majority vote that takes into account a trade-off between three quantities. The first two quantities being, as usual in the PAC-Bayesian approach, (a) the complexity of the majority vote (measured by a Kullback-Leibler divergence) and (b) its empirical risk (measured by the \rho-average errors on the source sample). The third quantity is (c) the capacity of the majority vote to distinguish some structural difference between the source and target samples.

Germain, Pascal, Habrard, Amaury, Laviolette, François, Morvant, Emilie

We study the issue of PAC-Bayesian domain adaptation: We want to learn, from a source domain, a majority vote model dedicated to a target one. Our theoretical contribution brings a new perspective by deriving an upper-bound on the target risk where the distributions' divergence---expressed as a ratio---controls the trade-off between a source error measure and the target voters' disagreement. Our bound suggests that one has to focus on regions where the source data is informative.From this result, we derive a PAC-Bayesian generalization bound, and specialize it to linear classifiers. Then, we infer a learning algorithmand perform experiments on real data.

We present a new generalization error bound, the \emph{PAC-Bayesian transportation bound}, unifying the PAC-Bayesian analysis and the generic chaining method in view of the optimal transportation. The proposed bound is the first PAC-Bayesian framework that characterizes the cost of de-randomization of stochastic predictors facing any Lipschitz loss functions. As an example, we give an upper bound on the de-randomization cost of spectrally normalized neural networks~(NNs) to evaluate how much randomness contributes to the generalization of NNs.