Bayesian Learning
Entropy-Constrained Training of Deep Neural Networks
Wiedemann, Simon, Marban, Arturo, Mรผller, Klaus-Robert, Samek, Wojciech
Abstract--We propose a general framework for neural network compression that is motivated by the Minimum Description Length (MDL) principle. For that we first derive an expression forthe entropy of a neural network, which measures its complexity explicitly in terms of its bit-size. This objective generalizes many of the compression techniques proposed in the literature, in that pruning or reducing the cardinality of the weight elements of the network can be seen special cases of entropy-minimization techniques. Furthermore, we derive a continuous relaxation of the objective, which allows us to minimize it using gradient based optimization techniques. Finally, we show that we can reach stateof-the-art compressionresults on different network architectures and data sets, e.g. I. INTRODUCTION It is well established that deep neural networks excel on a wide range of machine learning tasks [1].
On The Chain Rule Optimal Transport Distance
We define a novel class of distances between statistical multivariate distributions by solving an optimal transportation problem on their marginal densities with respect to a ground distance defined on their conditional densities. By using the chain rule factorization of probabilities, we show how to perform optimal transport on a ground space being an information-geometric manifold of conditional probabilities. We prove that this new distance is a metric whenever the chosen ground distance is a metric. Our distance generalizes both the Wasserstein distances between point sets and a recently introduced metric distance between statistical mixtures. As a first application of this Chain Rule Optimal Transport (CROT) distance, we show that the ground distance between statistical mixtures is upper bounded by this optimal transport distance, whenever the ground distance is joint convex. We report on our experiments which quantify the tightness of the CROT distance for the total variation distance and a square root generalization of the Jensen-Shannon divergence between mixtures.
MR image reconstruction using deep density priors
Tezcan, Kerem C., Baumgartner, Christian F., Luechinger, Roger, Pruessmann, Klaas P., Konukoglu, Ender
Algorithms for Magnetic Resonance (MR) image reconstruction from undersampled measurements exploit prior information to compensate for missing k-space data. Deep learning (DL) provides a powerful framework for extracting such information from existing image datasets, through learning, and then using it for reconstruction. Leveraging this, recent methods employed DL to learn mappings from undersampled to fully sampled images using paired datasets, including undersampled and corresponding fully sampled images, integrating prior knowledge implicitly. In this article, we propose an alternative approach that learns the probability distribution of fully sampled MR images using unsupervised DL, specifically Variational Autoencoders (VAE), and use this as an explicit prior term in reconstruction, completely decoupling the encoding operation from the prior. The resulting reconstruction algorithm enjoys a powerful image prior to compensate for missing k-space data without requiring paired datasets for training nor being prone to associated sensitivities, such as deviations in undersampling patterns used in training and test time or coil settings. We evaluated the proposed method with T1 weighted images from a publicly available dataset, multi-coil complex images acquired from healthy volunteers (N=8) and images with white matter lesions. The proposed algorithm, using the VAE prior, produced visually high quality reconstructions and achieved low RMSE values, outperforming most of the alternative methods on the same dataset. On multi-coil complex data, the algorithm yielded accurate magnitude and phase reconstruction results. In the experiments on images with white matter lesions, the method faithfully reconstructed the lesions. Keywords: Reconstruction, MRI, prior probability, machine learning, deep learning, unsupervised learning, density estimation
Disentangling group and link persistence in Dynamic Stochastic Block models
Barucca, Paolo, Lillo, Fabrizio, Mazzarisi, Piero, Tantari, Daniele
We study the inference of a model of dynamic networks in which both communities and links keep memory of previous network states. By considering maximum likelihood inference from single snapshot observations of the network, we show that link persistence makes the inference of communities harder, decreasing the detectability threshold, while community persistence tends to make it easier. We analytically show that communities inferred from single network snapshot can share a maximum overlap with the underlying communities of a specific previous instant in time. This leads to time-lagged inference: the identification of past communities rather than present ones. Finally we compute the time lag and propose a corrected algorithm, the Lagged Snapshot Dynamic (LSD) algorithm, for community detection in dynamic networks. We analytically and numerically characterize the detectability transitions of such algorithm as a function of the memory parameters of the model and we make a comparison with a full dynamic inference.
Clustering-Oriented Representation Learning with Attractive-Repulsive Loss
Kenyon-Dean, Kian, Cianflone, Andre, Page-Caccia, Lucas, Rabusseau, Guillaume, Cheung, Jackie Chi Kit, Precup, Doina
The standard loss function used to train neural network classifiers, categorical cross-entropy (CCE), seeks to maximize accuracy on the training data; building useful representations is not a necessary byproduct of this objective. In this work, we propose clustering-oriented representation learning (COREL) as an alternative to CCE in the context of a generalized attractive-repulsive loss framework. COREL has the consequence of building latent representations that collectively exhibit the quality of natural clustering within the latent space of the final hidden layer, according to a predefined similarity function. Despite being simple to implement, COREL variants outperform or perform equivalently to CCE in a variety of scenarios, including image and news article classification using both feed-forward and convolutional neural networks. Analysis of the latent spaces created with different similarity functions facilitates insights on the different use cases COREL variants can satisfy, where the Cosine-COREL variant makes a consistently clusterable latent space, while Gaussian-COREL consistently obtains better classification accuracy than CCE.
A geometric characterisation of sensitivity analysis in monomial models
Leonelli, Manuele, Riccomagno, Eva
Sensitivity analysis in probabilistic discrete graphical models is usually conducted by varying one probability value at a time and observing how this affects output probabilities of interest. When one probability is varied then others are proportionally covaried to respect the sum-to-one condition of probability laws. The choice of proportional covariation is justified by a variety of optimality conditions, under which the original and the varied distributions are as close as possible under different measures of closeness. For variations of more than one parameter at a time proportional covariation is justified in some special cases only. In this work, for the large class of discrete statistical models entertaining a regular monomial parametrisation, we demonstrate the optimality of newly defined proportional multi-way schemes with respect to an optimality criterion based on the notion of I-divergence. We demonstrate that there are varying parameters choices for which proportional covariation is not optimal and identify the sub-family of model distributions where the distance between the original distribution and the one where probabilities are covaried proportionally is minimum. This is shown by adopting a new formal, geometric characterization of sensitivity analysis in monomial models, which include a wide array of probabilistic graphical models. We also demonstrate the optimality of proportional covariation for multi-way analyses in Naive Bayes classifiers.
Machine Learning for Molecular Dynamics on Long Timescales
Molecular Dynamics (MD) simulation is widely used to analyze the properties of molecules and materials. Most practical applications, such as comparison with experimental measurements, designing drug molecules, or optimizing materials, rely on statistical quantities, which may be prohibitively expensive to compute from direct long-time MD simulations. Classical Machine Learning (ML) techniques have already had a profound impact on the field, especially for learning low-dimensional models of the long-time dynamics and for devising more efficient sampling schemes for computing long-time statistics. Novel ML methods have the potential to revolutionize long-timescale MD and to obtain interpretable models. ML concepts such as statistical estimator theory, end-to-end learning, representation learning and active learning are highly interesting for the MD researcher and will help to develop new solutions to hard MD problems. With the aim of better connecting the MD and ML research areas and spawning new research on this interface, we define the learning problems in long-timescale MD, present successful approaches and outline some of the unsolved ML problems in this application field.
Automatic Detection of Reflective Thinking in Mathematical Problem Solving based on Unconstrained Bodily Exploration
Olugbade, Temitayo A., Newbold, Joseph, Johnson, Rose, Volta, Erica, Alborno, Paolo, Niewiadomski, Radoslaw, Dillon, Max, Volpe, Gualtiero, Bianchi-Berthouze, Nadia
For technology (like serious games) that aims to deliver interactive learning, it is important to address relevant mental experiences such as reflective thinking during problem solving. To facilitate research in this direction, we present the weDraw-1 Movement Dataset of body movement sensor data and reflective thinking labels for 26 children solving mathematical problems in unconstrained settings where the body (full or parts) was required to explore these problems. Further, we provide qualitative analysis of behaviours that observers used in identifying reflective thinking moments in these sessions. The body movement cues from our compilation informed features that lead to average F1 score of 0.73 for automatic detection of reflective thinking based on Long Short-Term Memory neural networks. We further obtained 0.79 average F1 score for end-to-end detection of reflective thinking periods, i.e. based on raw sensor data. Finally, the algorithms resulted in 0.64 average F1 score for period subsegments as short as 4 seconds. Overall, our results show the possibility of detecting reflective thinking moments from body movement behaviours of a child exploring mathematical concepts bodily, such as within serious game play.
Adams Conditioning and Likelihood Ratio Transfer Mediated Inference
Bayesian inference as applied in a legal setting is about belief transfer and involves a plurality of agents and communication protocols. A forensic expert (FE) may communicate to a trier of fact (TOF) first its value of a certain likelihood ratio with respect to FE's belief state as represented by a probability function on FE's proposition space. Subsequently FE communicates its recently acquired confirmation that a certain evidence proposition is true. Then TOF performs likelihood ratio transfer mediated reasoning thereby revising their own belief state. The logical principles involved in likelihood transfer mediated reasoning are discussed in a setting where probabilistic arithmetic is done within a meadow, and with Adams conditioning placed in a central role.
Machine Learning: Arthur Samuel, Artificial Intelligence (AI) & Big Data
Programmed by Arthur Samuel, this big data discipline of artificial intelligence replaces the tedious task of trying to understand the problem well enough to be able to write a program, which can take much longer or be virtually impossible. Techopedia defines the discipline of machine learning as "an artificial intelligence (AI) discipline geared toward the technological development of human knowledge. Machine learning allows computers to handle new situations via analysis, self-training, observation and experience. Machine learning facilitates the continuous advancement of computing through exposure to new scenarios, testing and adaptation, while employing pattern and trend detection for improved decisions in subsequent (though not identical) situations." In 1959, IBM employee Arthur Samuel wanted to teach a computer to play checkers. So, he wrote the original program on IBM's first commercial computer, the IBM 701, but he kept winning.