Learning Graphical Models
Variational Autoencoders for Feature Detection of Magnetic Resonance Imaging Data
Hjelm, R. Devon, Plis, Sergey M., Calhoun, Vince C.
Independent component analysis (ICA), as an approach to the blind source-separation (BSS) problem, has become the de-facto standard in many medical imaging settings. Despite successes and a large ongoing research effort, the limitation of ICA to square linear transformations have not been overcome, so that general INFOMAX is still far from being realized. As an alternative, we present feature analysis in medical imaging as a problem solved by Helmholtz machines, which include dimensionality reduction and reconstruction of the raw data under the same objective, and which recently have overcome major difficulties in inference and learning with deep and nonlinear configurations. We demonstrate one approach to training Helmholtz machines, variational auto-encoders (VAE), as a viable approach toward feature extraction with magnetic resonance imaging (MRI) data.
How Robust are Reconstruction Thresholds for Community Detection?
Moitra, Ankur, Perry, William, Wein, Alexander S.
The stochastic block model is one of the oldest and most ubiquitous models for studying clustering and community detection. In an exciting sequence of developments, motivated by deep but non-rigorous ideas from statistical physics, Decelle et al. conjectured a sharp threshold for when community detection is possible in the sparse regime. Mossel, Neeman and Sly and Massoulie proved the conjecture and gave matching algorithms and lower bounds. Here we revisit the stochastic block model from the perspective of semirandom models where we allow an adversary to make `helpful' changes that strengthen ties within each community and break ties between them. We show a surprising result that these `helpful' changes can shift the information-theoretic threshold, making the community detection problem strictly harder. We complement this by showing that an algorithm based on semidefinite programming (which was known to get close to the threshold) continues to work in the semirandom model (even for partial recovery). This suggests that algorithms based on semidefinite programming are robust in ways that any algorithm meeting the information-theoretic threshold cannot be. These results point to an interesting new direction: Can we find robust, semirandom analogues to some of the classical, average-case thresholds in statistics? We also explore this question in the broadcast tree model, and we show that the viewpoint of semirandom models can help explain why some algorithms are preferred to others in practice, in spite of the gaps in their statistical performance on random models.
Phase transitions and sample complexity in Bayes-optimal matrix factorization
Kabashima, Yoshiyuki, Krzakala, Florent, Mézard, Marc, Sakata, Ayaka, Zdeborová, Lenka
We analyse the matrix factorization problem. Given a noisy measurement of a product of two matrices, the problem is to estimate back the original matrices. It arises in many applications such as dictionary learning, blind matrix calibration, sparse principal component analysis, blind source separation, low rank matrix completion, robust principal component analysis or factor analysis. It is also important in machine learning: unsupervised representation learning can often be studied through matrix factorization. We use the tools of statistical mechanics - the cavity and replica methods - to analyze the achievability and computational tractability of the inference problems in the setting of Bayes-optimal inference, which amounts to assuming that the two matrices have random independent elements generated from some known distribution, and this information is available to the inference algorithm. In this setting, we compute the minimal mean-squared-error achievable in principle in any computational time, and the error that can be achieved by an efficient approximate message passing algorithm. The computation is based on the asymptotic state-evolution analysis of the algorithm. The performance that our analysis predicts, both in terms of the achieved mean-squared-error, and in terms of sample complexity, is extremely promising and motivating for a further development of the algorithm.
Markov Chains explained visually
Markov chains, named after Andrey Markov, are mathematical systems that hop from one "state" (a situation or set of values) to another. For example, if you made a Markov chain model of a baby's behavior, you might include "playing," "eating", "sleeping," and "crying" as states, which together with other behaviors could form a'state space': a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probabilitiy of hopping, or "transitioning," from one state to any other state---e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first. With two states (A and B) in our state space, there are 4 possible transitions (not 2, because a state can transition back into itself). In this two state diagram, the probability of transitioning from any state to any other state is 0.5.
The Computational Power of Dynamic Bayesian Networks
Bayesian networks are probabilistic graphical models that represent a set of random variables and their conditional dependencies via a directed acyclic graph. Explicitly modeling the conditional dependencies between random variables permit efficient algorithms to perform inference and learning in the network. Causal Bayesian networks have the additional requirement that all edges in the network model a causal relationship. Dynamic Bayesian networks are the time-generalization of Bayesian networks and relate variables to each other over adjacent time steps. Dynamic Bayesian networks unify and extend a number of state-space models including hidden Markov models, hierarchical hidden Markov models and Kalman filters. Dynamic Bayesian networks can also be seen as the natural extension of acyclic causal models to models that permit cyclic causal relationships, while avoiding problems with causal models that try to model temporal relationships with an atemporal description [1]. A natural question is what is the expressive power of such networks. The result in this paper shows that although discrete dynamic Bayesian networks are sub-Turing in computational power, introducing continuous random variables with discrete children is sufficient to model Turing-complete computation.
A Probabilistic Machine Learning Approach to Detect Industrial Plant Faults
Fault detection in industrial plants is a hot research area as more and more sensor data are being collected throughout the industrial process. Automatic data-driven approaches are widely needed and seen as a promising area of investment. This paper proposes an effective machine learning algorithm to predict industrial plant faults based on classification methods such as penalized logistic regression, random forest and gradient boosted tree. A fault's start time and end time are predicted sequentially in two steps by formulating the original prediction problems as classification problems. The algorithms described in this paper won first place in the Prognostics and Health Management Society 2015 Data Challenge.
New Optimisation Methods for Machine Learning
A thesis submitted for the degree of Doctor of Philosophy of The Australian National University. In this work we introduce several new optimisation methods for problems in machine learning. Our algorithms broadly fall into two categories: optimisation of finite sums and of graph structured objectives. The finite sum problem is simply the minimisation of objective functions that are naturally expressed as a summation over a large number of terms, where each term has a similar or identical weight. Such objectives most often appear in machine learning in the empirical risk minimisation framework in the non-online learning setting. The second category, that of graph structured objectives, consists of objectives that result from applying maximum likelihood to Markov random field models. Unlike the finite sum case, all the non-linearity is contained within a partition function term, which does not readily decompose into a summation. For the finite sum problem, we introduce the Finito and SAGA algorithms, as well as variants of each. For graph-structured problems, we take three complementary approaches. We look at learning the parameters for a fixed structure, learning the structure independently, and learning both simultaneously. Specifically, for the combined approach, we introduce a new method for encouraging graph structures with the "scale-free" property. For the structure learning problem, we establish SHORTCUT, a O(n^{2.5}) expected time approximate structure learning method for Gaussian graphical models. For problems where the structure is known but the parameters unknown, we introduce an approximate maximum likelihood learning algorithm that is capable of learning a useful subclass of Gaussian graphical models.
Classification and Reconstruction of High-Dimensional Signals from Low-Dimensional Features in the Presence of Side Information
Renna, Francesco, Wang, Liming, Yuan, Xin, Yang, Jianbo, Reeves, Galen, Calderbank, Robert, Carin, Lawrence, Rodrigues, Miguel R. D.
This paper offers a characterization of fundamental limits on the classification and reconstruction of high-dimensional signals from low-dimensional features, in the presence of side information. We consider a scenario where a decoder has access both to linear features of the signal of interest and to linear features of the side information signal; while the side information may be in a compressed form, the objective is recovery or classification of the primary signal, not the side information. The signal of interest and the side information are each assumed to have (distinct) latent discrete labels; conditioned on these two labels, the signal of interest and side information are drawn from a multivariate Gaussian distribution. With joint probabilities on the latent labels, the overall signal-(side information) representation is defined by a Gaussian mixture model. We then provide sharp sufficient and/or necessary conditions for these quantities to approach zero when the covariance matrices of the Gaussians are nearly low-rank. These conditions, which are reminiscent of the well-known Slepian-Wolf and Wyner-Ziv conditions, are a function of the number of linear features extracted from the signal of interest, the number of linear features extracted from the side information signal, and the geometry of these signals and their interplay. Moreover, on assuming that the signal of interest and the side information obey such an approximately low-rank model, we derive expansions of the reconstruction error as a function of the deviation from an exactly low-rank model; such expansions also allow identification of operational regimes where the impact of side information on signal reconstruction is most relevant. Our framework, which offers a principled mechanism to integrate side information in high-dimensional data problems, is also tested in the context of imaging applications.
Mean-Field Inference in Gaussian Restricted Boltzmann Machine
Takahashi, Chako, Yasuda, Muneki
A Gaussian restricted Boltzmann machine (GRBM) is a Boltzmann machine defined on a bipartite graph and is an extension of usual restricted Boltzmann machines. A GRBM consists of two different layers: a visible layer composed of continuous visible variables and a hidden layer composed of discrete hidden variables. In this paper, we derive two different inference algorithms for GRBMs based on the naive mean-field approximation (NMFA). One is an inference algorithm for whole variables in a GRBM, and the other is an inference algorithm for partial variables in a GBRBM. We compare the two methods analytically and numerically and show that the latter method is better.
Towards An Architecture for Representation, Reasoning and Learning in Human-Robot Collaboration
Sridharan, Mohan (The University of Auckland)
Robots collaborating with humans need to represent knowledge, reason, and learn, at the sensorimotor level and the cognitive level. This paper summarizes the capabilities of an architecture that combines the comple- mentary strengths of declarative programming, proba- bilistic graphical models, and reinforcement learning, to represent, reason with, and learn from, qualitative and quantitative descriptions of incomplete domain knowledge and uncertainty. Representation and reasoning is based on two tightly-coupled domain representations at different resolutions. For any given task, the coarse- resolution symbolic domain representation is translated to an Answer Set Prolog program, which is solved to provide a tentative plan of abstract actions, and to explain unexpected outcomes. Each abstract action is implemented by translating the relevant subset of the corresponding fine-resolution probabilistic representation to a partially observable Markov decision process (POMDP). Any high probability beliefs, obtained by the execution of actions based on the POMDP policy, update the coarse-resolution representation. When incomplete knowledge of the rules governing the domain dynamics results in plan execution not achieving the desired goal, the coarse-resolution and fine-resolution representations are used to formulate the task of incrementally and interactively discovering these rules as a reinforcement learning problem. These capabilities are illustrated in the context of a mobile robot deployed in an indoor office domain.