Germain, Pascal, Lacasse, Alexandre, Marchand, Mario, Shanian, Sara, Laviolette, François

We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classifier that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. Papers published at the Neural Information Processing Systems Conference.

Marchand, Mario, Su, Hongyu, Morvant, Emilie, Rousu, Juho, Shawe-Taylor, John S.

We show that the usual score function for conditional Markov networks can be written as the expectation over the scores of their spanning trees. We also show that a small random sample of these output trees can attain a significant fraction of the margin obtained by the complete graph and we provide conditions under which we can perform tractable inference. The experimental results confirm that practical learning is scalable to realistic datasets using this approach. Papers published at the Neural Information Processing Systems Conference.

Tossou, Prudencio, Dura, Basile, Laviolette, Francois, Marchand, Mario, Lacoste, Alexandre

Deep kernel learning provides an elegant and principled framework for combining the structural properties of deep learning algorithms with the flexibility of kernel methods. By means of a deep neural network, it consists of learning a kernel operator which is combined with a differentiable kernel algorithm for inference. While previous work within this framework has mostly explored learning a single kernel for large datasets, we focus herein on learning a kernel family for a variety of tasks in few-shot regression settings. Compared to single deep kernel learning, our novel algorithm permits finding the appropriate kernel for each task during inference, rather than using the same for all tasks. As such, our algorithm performs more effectively with complex task distributions in few-shot learning, which we demonstrate by benchmarking against existing state-of-the-art algorithms using real-world, few-shot regression tasks related to drug discovery.

Drouin, Alexandre, Raymond, Frédéric, St-Pierre, Gaël Letarte, Marchand, Mario, Corbeil, Jacques, Laviolette, François

Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial agents, by determining which antibiotics are likely to be effective in specific clinical cases. In healthcare, this would allow for the design of treatment plans tailored for specific individuals, likely resulting in better clinical outcomes for patients with bacterial infections. In this work, we present the recent work of Drouin et al. (2016) on using Set Covering Machines to learn highly interpretable models of antibiotic resistance and complement it by providing a large scale application of their method to the entire PATRIC database. We report prediction results for 36 new datasets and present the Kover AMR platform, a new web-based tool allowing the visualization and interpretation of the generated models.

Ganin, Yaroslav, Ustinova, Evgeniya, Ajakan, Hana, Germain, Pascal, Larochelle, Hugo, Laviolette, François, Marchand, Mario, Lempitsky, Victor

We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

Germain, Pascal, Lacasse, Alexandre, Laviolette, François, Marchand, Mario, Roy, Jean-Francis

We propose an extensive analysis of the behavior of majority votes in binary classification. In particular, we introduce a risk bound for majority votes, called the C-bound, that takes into account the average quality of the voters and their average disagreement. We also propose an extensive PAC-Bayesian analysis that shows how the C-bound can be estimated from various observations contained in the training data. The analysis intends to be self-contained and can be used as introductory material to PAC-Bayesian statistical learning theory. It starts from a general PAC-Bayesian perspective and ends with uncommon PAC-Bayesian bounds. Some of these bounds contain no Kullback-Leibler divergence and others allow kernel functions to be used as voters (via the sample compression setting). Finally, out of the analysis, we propose the MinCq learning algorithm that basically minimizes the C-bound. MinCq reduces to a simple quadratic program. Aside from being theoretically grounded, MinCq achieves state-of-the-art performance, as shown in our extensive empirical comparison with both AdaBoost and the Support Vector Machine.

Drouin, Alexandre, Giguère, Sébastien, Déraspe, Maxime, Laviolette, François, Marchand, Mario, Corbeil, Jacques

The Set Covering Machine (SCM) is a greedy learning algorithm that produces sparse classifiers. We extend the SCM for datasets that contain a huge number of features. The whole genetic material of living organisms is an example of such a case, where the number of feature exceeds 10^7. Three human pathogens were used to evaluate the performance of the SCM at predicting antimicrobial resistance. Our results show that the SCM compares favorably in terms of sparsity and accuracy against L1 and L2 regularized Support Vector Machines and CART decision trees. Moreover, the SCM was the only algorithm that could consider the full feature space. For all other algorithms, the latter had to be filtered as a preprocessing step.

Ajakan, Hana, Germain, Pascal, Larochelle, Hugo, Laviolette, François, Marchand, Mario

We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).

Marchand, Mario, Su, Hongyu, Morvant, Emilie, Rousu, Juho, Shawe-Taylor, John S.

We show that the usual score function for conditional Markov networks can be written as the expectation over the scores of their spanning trees. We also show that a small random sample of these output trees can attain a significant fraction of the margin obtained by the complete graph and we provide conditions under which we can perform tractable inference. The experimental results confirm that practical learning is scalable to realistic datasets using this approach.

Drouin, Alexandre, Giguère, Sébastien, Sagatovich, Vladana, Déraspe, Maxime, Laviolette, François, Marchand, Mario, Corbeil, Jacques

The increased affordability of whole genome sequencing has motivated its use for phenotypic studies. We address the problem of learning interpretable models for discrete phenotypes from whole genomes. We propose a general approach that relies on the Set Covering Machine and a k-mer representation of the genomes. We show results for the problem of predicting the resistance of Pseudomonas Aeruginosa, an important human pathogen, against 4 antibiotics. Our results demonstrate that extremely sparse models which are biologically relevant can be learnt using this approach.