Goto

Collaborating Authors

 Geiger, Bernhard C.


SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder

arXiv.org Artificial Intelligence

We propose a semi-supervised generative model, SeGMA, which learns a joint probability distribution of data and their classes and which is implemented in a typical Wasserstein auto-encoder framework. We choose a mixture of Gaussians as a target distribution in latent space, which provides a natural splitting of data into clusters. To connect Gaussian components with correct classes, we use a small amount of labeled data and a Gaussian classifier induced by the target distribution. SeGMA is optimized efficiently due to the use of Cramer-Wold distance as a maximum mean discrepancy penalty, which yields a closed-form expression for a mixture of spherical Gaussian components and thus obviates the need of sampling. While SeGMA preserves all properties of its semi-supervised predecessors and achieves at least as good generative performance on standard benchmark data sets, it presents additional features: (a) interpolation between any pair of points in the latent space produces realistically-looking samples; (b) combining the interpolation property with disentangled class and style variables, SeGMA is able to perform a continuous style transfer from one class to another; (c) it is possible to change the intensity of class characteristics in a data point by moving the latent representation of the data point away from specific Gaussian components.


Class-Conditional Compression and Disentanglement: Bridging the Gap between Neural Networks and Naive Bayes Classifiers

arXiv.org Machine Learning

In this draft, which reports on work in progress, we 1) adapt the information bottleneck functional by replacing the compression term by class-conditional compression, 2) relax this functional using a variational bound related to class-conditional disentanglement, 3) consider this functional as a training objective for stochastic neural networks, and 4) show that the latent representations are learned such that they can be used in a naive Bayes classifier. We continue by suggesting a series of experiments along the lines of Nonlinear Information Bottleneck [Kolchinsky et al., 2018], Deep Variational Information Bottleneck [Alemi et al., 2017], and Information Dropout [Achille and Soatto, 2018]. We furthermore suggest a neural network where the decoder architecture is a parameterized naive Bayes decoder. We consider a classification task with a feature random variable (RV) X on R and a class RV Y on the finite set Y of classes. We further consider stochastic feed-forward neural networks (NNs).


Understanding Individual Neuron Importance Using Information Theory

arXiv.org Machine Learning

In this work, we characterize the outputs of individual neurons in a trained feed-forward neural network by entropy, mutual information with the class variable, and a class selectivity measure based on Kullback-Leibler divergence. By cumulatively ablating neurons in the network, we connect these information-theoretic measures to the impact their removal has on classification performance on the test set. We observe that, looking at the neural network as a whole, none of these measures is a good indicator for classification performance, thus confirming recent results by Morcos et al. However, looking at specific layers separately, both mutual information and class selectivity are positively correlated with classification performance. We thus conclude that it is ill-advised to compare these measures across layers, and that different layers may be most appropriately characterized by different measures. We then discuss pruning neurons from neural networks to reduce computational complexity of inference. Drawing from our results, we perform pruning based on information-theoretic measures on a fully connected feed-forward neural network with two hidden layers trained on MNIST dataset and compare the results to a recently proposed pruning method. We furthermore show that the common practice of re-training after pruning can partly be obviated by a surgery step called bias balancing, without incurring significant performance degradation.


Semi-supervised cross-entropy clustering with information bottleneck constraint

arXiv.org Machine Learning

In this paper, we propose a semi-supervised clustering method, CEC-IB, that models data with a set of Gaussian distributions and that retrieves clusters based on a partial labeling provided by the user (partition-level side information). By combining the ideas from cross-entropy clustering (CEC) with those from the information bottleneck method (IB), our method trades between three conflicting goals: the accuracy with which the data set is modeled, the simplicity of the model, and the consistency of the clustering with side information. Experiments demonstrate that CEC-IB has a performance comparable to Gaussian mixture models (GMM) in a classical semi-supervised scenario, but is faster, more robust to noisy labels, automatically determines the optimal number of clusters, and performs well when not all classes are present in the side information. Moreover, in contrast to other semi-supervised models, it can be successfully applied in discovering natural subgroups if the partition-level side information is derived from the top levels of a hierarchical clustering.