Goto

Collaborating Authors

 Bayesian Learning


A Birth and Death Process for Bayesian Network Structure Inference

arXiv.org Machine Learning

Bayesian networks (Pearl [13]) are convenient graphical expressions for high dimensional probability distributions representing complex relationships between a large number of random variables. A Bayesian network is a directed acyclic graph consisting of nodes which represent random variables and arrows which correspond to probabilistic dependencies between them. There has been a great deal of interest in recent years on the NPhard problem of learning the structure (placement of directed edges) of Bayesian networks from data ([1],[2],[4],[5], [6],[8],[9],[11],[12]). Much of this has been driven by the study of genetic regulatory networks in molecular biology due to advances in technology and, specifically, microarray techniques that allow scientists to rapidly measure expression levels of genes in cells. As an integral part of machine learning, Bayesian networks have also been used for pattern recognition, language processing including speech recognition, and credit risk analysis. Structure learning typically involves defining a network score function and is then, in theory, a straightforward optimization problem.


On Identification of Sparse Multivariable ARX Model: A Sparse Bayesian Learning Approach

arXiv.org Machine Learning

This paper begins with considering the identification of sparse linear time-invariant networks described by multivariable ARX models. Such models possess relatively simple structure thus used as a benchmark to promote further research. With identifiability of the network guaranteed, this paper presents an identification method that infers both the Boolean structure of the network and the internal dynamics between nodes. Identification is performed directly from data without any prior knowledge of the system, including its order. The proposed method solves the identification problem using Maximum a posteriori estimation (MAP) but with inseparable penalties for complexity, both in terms of element (order of nonzero connections) and group sparsity (network topology). Such an approach is widely applied in Compressive Sensing (CS) and known as Sparse Bayesian Learning (SBL). We then propose a novel scheme that combines sparse Bayesian and group sparse Bayesian to efficiently solve the problem. The resulted algorithm has a similar form of the standard Sparse Group Lasso (SGL) while with known noise variance, it simplifies to exact re-weighted SGL. The method and the developed toolbox can be applied to infer networks from a wide range of fields, including systems biology applications such as signaling and genetic regulatory networks.


Network structure, metadata and the prediction of missing nodes and annotations

arXiv.org Machine Learning

The empirical validation of community detection methods is often based on available annotations on the nodes that serve as putative indicators of the large-scale network structure. Most often, the suitability of the annotations as topological descriptors itself is not assessed, and without this it is not possible to ultimately distinguish between actual shortcomings of the community detection algorithms on one hand, and the incompleteness, inaccuracy or structured nature of the data annotations themselves on the other. In this work we present a principled method to access both aspects simultaneously. We construct a joint generative model for the data and metadata, and a nonparametric Bayesian framework to infer its parameters from annotated datasets. We assess the quality of the metadata not according to its direct alignment with the network communities, but rather in its capacity to predict the placement of edges in the network. We also show how this feature can be used to predict the connections to missing nodes when only the metadata is available, as well as missing metadata. By investigating a wide range of datasets, we show that while there are seldom exact agreements between metadata tokens and the inferred data groups, the metadata is often informative of the network structure nevertheless, and can improve the prediction of missing nodes. This shows that the method uncovers meaningful patterns in both the data and metadata, without requiring or expecting a perfect agreement between the two.


Maximum Entropy Vector Kernels for MIMO system identification

arXiv.org Machine Learning

Recent contributions have framed linear system identification as a nonparametric regularized inverse problem. Relying on $\ell_2$-type regularization which accounts for the stability and smoothness of the impulse response to be estimated, these approaches have been shown to be competitive w.r.t classical parametric methods. In this paper, adopting Maximum Entropy arguments, we derive a new $\ell_2$ penalty deriving from a vector-valued kernel; to do so we exploit the structure of the Hankel matrix, thus controlling at the same time complexity, measured by the McMillan degree, stability and smoothness of the identified models. As a special case we recover the nuclear norm penalty on the squared block Hankel matrix. In contrast with previous literature on reweighted nuclear norm penalties, our kernel is described by a small number of hyper-parameters, which are iteratively updated through marginal likelihood maximization; constraining the structure of the kernel acts as a (hyper)regularizer which helps controlling the effective degrees of freedom of our estimator. To optimize the marginal likelihood we adapt a Scaled Gradient Projection (SGP) algorithm which is proved to be significantly computationally cheaper than other first and second order off-the-shelf optimization methods. The paper also contains an extensive comparison with many state-of-the-art methods on several Monte-Carlo studies, which confirms the effectiveness of our procedure.


How to evaluate Data Science models ?

@machinelearnbot

In today's Digital age, insights received from data science are extremely important to deliver the best customer experience. Data Scientists use various techniques such as Regression, SVM, Neural network, Nearest neighbor, Naive Bayes, Decision Tree and Ensemble models. These algorithms help to identify previously unrecognized patterns and trends hidden within vast amounts of structured and unstructured information. These patterns are used to create predictive models that try to forecast future behavior. These models have many practical business applications: predicting patients at risk, they help banks decide which customers to approve for loans, and marketers use them to determine which leads to target with campaigns. But how to determine if the predictive models you create are accurate, meaningful representations that will prove valuable to your organization?


Whole-brain substitute CT generation using Markov random field mixture models

arXiv.org Machine Learning

Computed tomography (CT) equivalent information is needed for attenuation correction in PET imaging and for dose planning in radiotherapy. Prior work has shown that Gaussian mixture models can be used to generate a substitute CT (s-CT) image from a specific set of MRI modalities. This work introduces a more flexible class of mixture models for s-CT generation, that incorporates spatial dependency in the data through a Markov random field prior on the latent field of class memberships associated with a mixture model. Furthermore, the mixture distributions are extended from Gaussian to normal inverse Gaussian (NIG), allowing heavier tails and skewness. The amount of data needed to train a model for s-CT generation is of the order of 100 million voxels. The computational efficiency of the parameter estimation and prediction methods are hence paramount, especially when spatial dependency is included in the models. A stochastic Expectation Maximization (EM) gradient algorithm is proposed in order to tackle this challenge. The advantages of the spatial model and NIG distributions are evaluated with a cross-validation study based on data from 14 patients. The study show that the proposed model enhances the predictive quality of the s-CT images by reducing the mean absolute error with 17.9%. Also, the distribution of CT values conditioned on the MR images are better explained by the proposed model as evaluated using continuous ranked probability scores.


Predictive Coarse-Graining

arXiv.org Machine Learning

We propose a data-driven, coarse-graining formulation in the context of equilibrium statistical mechanics. In contrast to existing techniques which are based on a fine-to-coarse map, we adopt the opposite strategy by prescribing a probabilistic coarse-to-fine map. This corresponds to a directed probabilistic model where the coarse variables play the role of latent generators of the fine scale (all-atom) data. From an information-theoretic perspective, the framework proposed provides an improvement upon the relative entropy method and is capable of quantifying the uncertainty due to the information loss that unavoidably takes place during the CG process. Furthermore, it can be readily extended to a fully Bayesian model where various sources of uncertainties are reflected in the posterior of the model parameters. The latter can be used to produce not only point estimates of fine-scale reconstructions or macroscopic observables, but more importantly, predictive posterior distributions on these quantities. Predictive posterior distributions reflect the confidence of the model as a function of the amount of data and the level of coarse-graining. The issues of model complexity and model selection are seamlessly addressed by employing a hierarchical prior that favors the discovery of sparse solutions, revealing the most prominent features in the coarse-grained model. A flexible and parallelizable Monte Carlo - Expectation-Maximization (MC-EM) scheme is proposed for carrying out inference and learning tasks. A comparative assessment of the proposed methodology is presented for a lattice spin system and the SPC/E water model.


Machine Learning: Filtering Email for Spam or Ham - Code School Blog

#artificialintelligence

You may have seen our previous posts on machine learning -- specifically, how to let your code learn from text and working with stop words, stemming, and spam. So today, we're going to build our machine learning-based spam filter, using the tools we walked through in those posts: tokenizer, stemmer, and naive bayes classifier. We are going to work with bluebird promise library here, so if you are not used to promises, please take a look at the bluebird API reference. Before we begin, it's important to have good training data. You can download some here -- we are interested in two.


Online Categorical Subspace Learning for Sketching Big Data with Misses

arXiv.org Machine Learning

With the scale of data growing every day, reducing the dimensionality (a.k.a. sketching) of high-dimensional data has emerged as a task of paramount importance. Relevant issues to address in this context include the sheer volume of data that may consist of categorical samples, the typically streaming format of acquisition, and the possibly missing entries. To cope with these challenges, the present paper develops a novel categorical subspace learning approach to unravel the latent structure for three prominent categorical (bilinear) models, namely, Probit, Tobit, and Logit. The deterministic Probit and Tobit models treat data as quantized values of an analog-valued process lying in a low-dimensional subspace, while the probabilistic Logit model relies on low dimensionality of the data log-likelihood ratios. Leveraging the low intrinsic dimensionality of the sought models, a rank regularized maximum-likelihood estimator is devised, which is then solved recursively via alternating majorization-minimization to sketch high-dimensional categorical data `on the fly.' The resultant procedure alternates between sketching the new incomplete datum and refining the latent subspace, leading to lightweight first-order algorithms with highly parallelizable tasks per iteration. As an extra degree of freedom, the quantization thresholds are also learned jointly along with the subspace to enhance the predictive power of the sought models. Performance of the subspace iterates is analyzed for both infinite and finite data streams, where for the former asymptotic convergence to the stationary point set of the batch estimator is established, while for the latter sublinear regret bounds are derived for the empirical cost. Simulated tests with both synthetic and real-world datasets corroborate the merits of the novel schemes for real-time movie recommendation and chess-game classification.


Orthogonal parallel MCMC methods for sampling and optimization

arXiv.org Machine Learning

Monte Carlo (MC) methods are widely used for Bayesian inference and optimization in statistics, signal processing and machine learning. A well-known class of MC methods are Markov Chain Monte Carlo (MCMC) algorithms. In order to foster better exploration of the state space, specially in high-dimensional applications, several schemes employing multiple parallel MCMC chains have been recently introduced. In this work, we describe a novel parallel interacting MCMC scheme, called {\it orthogonal MCMC} (O-MCMC), where a set of "vertical" parallel MCMC chains share information using some "horizontal" MCMC techniques working on the entire population of current states. More specifically, the vertical chains are led by random-walk proposals, whereas the horizontal MCMC techniques employ independent proposals, thus allowing an efficient combination of global exploration and local approximation. The interaction is contained in these horizontal iterations. Within the analysis of different implementations of O-MCMC, novel schemes in order to reduce the overall computational cost of parallel multiple try Metropolis (MTM) chains are also presented. Furthermore, a modified version of O-MCMC for optimization is provided by considering parallel simulated annealing (SA) algorithms. Numerical results show the advantages of the proposed sampling scheme in terms of efficiency in the estimation, as well as robustness in terms of independence with respect to initial values and the choice of the parameters.