Goto

Collaborating Authors

 Learning Graphical Models


Scalable Latent Tree Model and its Application to Health Analytics

arXiv.org Machine Learning

Latent tree graphical models are a popular class of latent variable models, where a probability distribution involving observed and hidden variables are Markovian on a tree. Due to the fact that structure of (observable and hidden) variable interactions are approximated as a tree, inference on latent trees can be carried out exactly through a simple belief propagation [Pea88]. Therefore, latent tree graphical models present a good tradeoff between model accuracy and computational complexity. They are applicable in many domains, where it is natural to expect hierarchical or sequential relationships among the variables (through a hidden-Markov model). For instance, latent tree models have been employed for phylogenetic reconstruction [DEKM99], object recognition [CTW12a, CTW12b] and human pose estimation [WL13]. In this paper, we use latent tree model for discovering a hierarchy among diseases based on comorbidities exhibited in patients' health records, i.e. co-occurrences of diseases in patients. In particular, two large healthcare datasets of 30K and 1.6M patients are used to build the latent disease trees, where clinically meaningful disease clusters are identified as shown in fig 3 and 4. The task of learning a latent tree models consists of two parts: learning the tree structure, and learning the parameters of the tree. There exist many challenges which prohibit efficient or guaranteed learning of the latent tree graphical model, which will be addressed in this paper: 1. The location and the number of latent variables are hidden and the marginalized graph over the observable variables no longer conforms to a tree structure.


An Adaptive Online HDP-HMM for Segmentation and Classification of Sequential Data

arXiv.org Machine Learning

The joint problem of time segmentation and recognition of sequential data into meaningful subsequences has attracted significant research in a variety of domains. The ability to automatically segment and classify data is a core technology for applications like speaker diarisation, finance, activity understanding, multimedia annotation and human-computer interaction. To date, the main proposed solutions have included sliding windows [1], the hidden Markov model (HMM) [2], conditional random fields [3] [4], and structural SVM [5], covering the spectrum of generative, discriminative and maximum-margin dynamic classifiers. Along with advancements in learning and inference, research has witnessed increasingly realistic datasets which are bridging the gap between lab and real applications [6] [7]. Nevertheless, important challenges such as model adaptation and dynamic class sets remain unresolved. We address both these limitations by an adaptive online model that can accommodate an unlimited (theoretically infinite) number of classes. In a nutshell, this is achieved by applying a Bayesian nonparametric model, the hierarchical Dirichlet process (HDP), as the prior for a hidden Markov model (a model known as HDP-HMM [8] [9]), and exploiting an adaptive learning rate for model adaptation. The proposed model provides an adaptive online learning approach for joint segmentation and recognition of sequential data 1 with incremental class sets and we refer to it as ADON HDP-HMM in the following.


Qualitative inequalities for squared partial correlations of a Gaussian random vector

arXiv.org Machine Learning

We describe various sets of conditional independence relationships, sufficient for qualitatively comparing non-vanishing squared partial correlations of a Gaussian random vector. These sufficient conditions are satisfied by several graphical Markov models. Rules for comparing degree of association among the vertices of such Gaussian graphical models are also developed. We apply these rules to compare conditional dependencies on Gaussian trees. In particular for trees, we show that such dependence can be completely characterized by the length of the paths joining the dependent vertices to each other and to the vertices conditioned on. We also apply our results to postulate rules for model selection for polytree models. Our rules apply to mutual information of Gaussian random vectors as well.


Geometry and Expressive Power of Conditional Restricted Boltzmann Machines

arXiv.org Machine Learning

Conditional restricted Boltzmann machines are undirected stochastic neural networks with a layer of input and output units connected bipartitely to a layer of hidden units. These networks define models of conditional probability distributions on the states of the output units given the states of the input units, parametrized by interaction weights and biases. We address the representational power of these models, proving results their ability to represent conditional Markov random fields and conditional distributions with restricted supports, the minimal size of universal approximators, the maximal model approximation errors, and on the dimension of the set of representable conditional distributions. We contribute new tools for investigating conditional probability models, which allow us to improve the results that can be derived from existing work on restricted Boltzmann machine probability models.


Describing and Understanding Neighborhood Characteristics through Online Social Media

arXiv.org Machine Learning

Geotagged data can be used to describe regions in the world and discover local themes. However, not all data produced within a region is necessarily specifically descriptive of that area. To surface the content that is characteristic for a region, we present the geographical hierarchy model (GHM), a probabilistic model based on the assumption that data observed in a region is a random mixture of content that pertains to different levels of a hierarchy. We apply the GHM to a dataset of 8 million Flickr photos in order to discriminate between content (i.e., tags) that specifically characterizes a region (e.g., neighborhood) and content that characterizes surrounding areas or more general themes. Knowledge of the discriminative and non-discriminative terms used throughout the hierarchy enables us to quantify the uniqueness of a given region and to compare similar but distant regions. Our evaluation demonstrates that our model improves upon traditional Naive Bayes classification by 47% and hierarchical TF-IDF by 27%. We further highlight the differences and commonalities with human reasoning about what is locally characteristic for a neighborhood, distilled from ten interviews and a survey that covered themes such as time, events, and prior regional knowledge.


Switching to Learn

arXiv.org Machine Learning

Distributed estimation, detection, and learning theory in networks have attracted much attention over the past decades [1], [2], [3], [4], with applications that range from sensor and robotic networks [5], [6], [7], [8], [9] to social and economic networks [10], [11], [12]. In these scenarios, agents in a network need to learn the value of a parameter that they may not be able to infer on their own, but the global spread of information in the network provides them with adequate data to learn the truth collectively. As a result, agents iteratively exchange information with their neighbors. For instance, in distributed sensor and robotic networks, agents use local diffusion to augment their imperfect observations with information from their neighbors and achieve consensus and coordination [13], [14]. Similarly, agents exchange beliefs in social networks to benefit from each other's observations and private information and learn the unknown state of the world [15], [16]. Existing literature on distributed learning focuses mostly on environments where individuals communicate at every round. Of particular relevance to our discussion are a host of algorithms that follow the non-Bayesian learning scheme in Jadbabaie et.


Directed Information Graphs

arXiv.org Machine Learning

We propose a graphical model for representing networks of stochastic processes, the minimal generative model graph. It is based on reduced factorizations of the joint distribution over time. We show that under appropriate conditions, it is unique and consistent with another type of graphical model, the directed information graph, which is based on a generalization of Granger causality. We demonstrate how directed information quantifies Granger causality in a particular sequential prediction setting. We also develop efficient methods to estimate the topological structure from data that obviate estimating the joint statistics. One algorithm assumes upper-bounds on the degrees and uses the minimal dimension statistics necessary. In the event that the upper-bounds are not valid, the resulting graph is nonetheless an optimal approximation. Another algorithm uses near-minimal dimension statistics when no bounds are known but the distribution satisfies a certain criterion. Analogous to how structure learning algorithms for undirected graphical models use mutual information estimates, these algorithms use directed information estimates. We characterize the sample-complexity of two plug-in directed information estimators and obtain confidence intervals. For the setting when point estimates are unreliable, we propose an algorithm that uses confidence intervals to identify the best approximation that is robust to estimation error. Lastly, we demonstrate the effectiveness of the proposed algorithms through analysis of both synthetic data and real data from the Twitter network. In the latter case, we identify which news sources influence users in the network by merely analyzing tweet times.


L_1-regularized Boltzmann machine learning using majorizer minimization

arXiv.org Machine Learning

We propose an inference method to estimate sparse interactions and biases according to Boltzmann machine learning. The basis of this method is $L_1$ regularization, which is often used in compressed sensing, a technique for reconstructing sparse input signals from undersampled outputs. $L_1$ regularization impedes the simple application of the gradient method, which optimizes the cost function that leads to accurate estimations, owing to the cost function's lack of smoothness. In this study, we utilize the majorizer minimization method, which is a well-known technique implemented in optimization problems, to avoid the non-smoothness of the cost function. By using the majorizer minimization method, we elucidate essentially relevant biases and interactions from given data with seemingly strongly-correlated components.


Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC

arXiv.org Machine Learning

Despite having various attractive qualities such as high prediction accuracy and the ability to quantify uncertainty and avoid over-fitting, Bayesian Matrix Factorization has not been widely adopted because of the prohibitive cost of inference. In this paper, we propose a scalable distributed Bayesian matrix factorization algorithm using stochastic gradient MCMC. Our algorithm, based on Distributed Stochastic Gradient Langevin Dynamics, can not only match the prediction accuracy of standard MCMC methods like Gibbs sampling, but at the same time is as fast and simple as stochastic gradient descent. In our experiments, we show that our algorithm can achieve the same level of prediction accuracy as Gibbs sampling an order of magnitude faster. We also show that our method reduces the prediction error as fast as distributed stochastic gradient descent, achieving a 4.1% improvement in RMSE for the Netflix dataset and an 1.8% for the Yahoo music dataset.


Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

arXiv.org Machine Learning

Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a non-linear transformation of a continuous latent space. Non-linearity captures multi-modality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data.