Directed Networks
Parameterized Complexity Results for Exact Bayesian Network Structure Learning
Bayesian network structure learning is the notoriously difficult problem of discovering a Bayesian network that optimally represents a given set of training data. In this paper we study the computational worst-case complexity of exact Bayesian network structure learning under graph theoretic restrictions on the (directed) super-structure. The super-structure is an undirected graph that contains as subgraphs the skeletons of solution networks. We introduce the directed super-structure as a natural generalization of its undirected counterpart. Our results apply to several variants of score-based Bayesian network structure learning where the score of a network decomposes into local scores of its nodes. Results: We show that exact Bayesian network structure learning can be carried out in non-uniform polynomial time if the super-structure has bounded treewidth, and in linear time if in addition the super-structure has bounded maximum degree. Furthermore, we show that if the directed super-structure is acyclic, then exact Bayesian network structure learning can be carried out in quadratic time. We complement these positive results with a number of hardness results. We show that both restrictions (treewidth and degree) are essential and cannot be dropped without loosing uniform polynomial time tractability (subject to a complexity-theoretic assumption). Similarly, exact Bayesian network structure learning remains NP-hard for "almost acyclic" directed super-structures. Furthermore, we show that the restrictions remain essential if we do not search for a globally optimal network but aim to improve a given network by means of at most k arc additions, arc deletions, or arc reversals (k-neighborhood local search).
Impulsive Noise Mitigation in Powerline Communications Using Sparse Bayesian Learning
Lin, Jing, Nassar, Marcel, Evans, Brian L.
Additive asynchronous and cyclostationary impulsive noise limits communication performance in OFDM powerline communication (PLC) systems. Conventional OFDM receivers assume additive white Gaussian noise and hence experience degradation in communication performance in impulsive noise. Alternate designs assume a parametric statistical model of impulsive noise and use the model parameters in mitigating impulsive noise. These receivers require overhead in training and parameter estimation, and degrade due to model and parameter mismatch, especially in highly dynamic environments. In this paper, we model impulsive noise as a sparse vector in the time domain without any other assumptions, and apply sparse Bayesian learning methods for estimation and mitigation without training. We propose three iterative algorithms with different complexity vs. performance trade-offs: (1) we utilize the noise projection onto null and pilot tones to estimate and subtract the noise impulses; (2) we add the information in the data tones to perform joint noise estimation and OFDM detection; (3) we embed our algorithm into a decision feedback structure to further enhance the performance of coded systems. When compared to conventional OFDM PLC receivers, the proposed receivers achieve SNR gains of up to 9 dB in coded and 10 dB in uncoded systems in the presence of impulsive noise.
Continuous-time Infinite Dynamic Topic Models
Topic models are probabilistic models for discovering topical themes in collections of documents. In real world applications, these models provide us with the means of organizing what would otherwise be unstructured collections. They can help us cluster a huge collection into different topics or find a subset of the collection that resembles the topical theme found in an article at hand. The first wave of topic models developed were able to discover the prevailing topics in a big collection of documents spanning a period of time. It was later realized that these time-invariant models were not capable of modeling 1) the time varying number of topics they discover and 2) the time changing structure of these topics. Few models were developed to address this two deficiencies. The online-hierarchical Dirichlet process models the documents with a time varying number of topics. It varies the structure of the topics over time as well. However, it relies on document order, not timestamps to evolve the model over time. The continuous-time dynamic topic model evolves topic structure in continuous-time. However, it uses a fixed number of topics over time. In this dissertation, I present a model, the continuous-time infinite dynamic topic model, that combines the advantages of these two models 1) the online-hierarchical Dirichlet process, and 2) the continuous-time dynamic topic model. More specifically, the model I present is a probabilistic topic model that does the following: 1) it changes the number of topics over continuous time, and 2) it changes the topic structure over continuous-time. I compared the model I developed with the two other models with different setting values. The results obtained were favorable to my model and showed the need for having a model that has a continuous-time varying number of topics and topic structure.
Bayesian Consensus Clustering
Lock, Eric F., Dunson, David B.
The task of clustering a set of objects based on multiple sources of data arises in several modern applications. We propose an integrative statistical model that permits a separate clustering of the objects for each data source. These separate clusterings adhere loosely to an overall consensus clustering, and hence they are not independent. We describe a computationally scalable Bayesian framework for simultaneous estimation of both the consensus clustering and the source-specific clusterings. We demonstrate that this flexible approach is more robust than joint clustering of all data sources, and is more powerful than clustering each data source separately. This work is motivated by the integrated analysis of heterogeneous biomedical data, and we present an application to subtype identification of breast cancer tumor samples using publicly available data from The Cancer Genome Atlas. Several fields of research now analyze multi-source data (also called multimodal data), in which multiple heterogeneous datasets describe a common set of objects.
A probabilistic methodology for multilabel classification
Romero, Alfonso E., de Campos, Luis M.
Multilabel classification is a relatively recent subfield of machine learning. Unlike to the classical approach, where instances are labeled with only one category, in multilabel classification, an arbitrary number of categories is chosen to label an instance. Due to the problem complexity (the solution is one among an exponential number of alternatives), a very common solution (the binary method) is frequently used, learning a binary classifier for every category, and combining them all afterwards. The assumption taken in this solution is not realistic, and in this work we give examples where the decisions for all the labels are not taken independently, and thus, a supervised approach should learn those existing relationships among categories to make a better classification. Therefore, we show here a generic methodology that can improve the results obtained by a set of independent probabilistic binary classifiers, by using a combination procedure with a classifier trained on the co-occurrences of the labels. We show an exhaustive experimentation in three different standard corpora of labeled documents (Reuters-21578, Ohsumed-23 and RCV1), which present noticeable improvements in all of them, when using our methodology, in three probabilistic base classifiers.
Learning Gaussian Networks
We describe algorithms for learning Bayesian networks from a combination of user knowledge and statistical data. The algorithms have two components: a scoring metric and a search procedure. The scoring metric takes a network structure, statistical data, and a user's prior knowledge, and returns a score proportional to the posterior probability of the network structure given the data. The search procedure generates networks for evaluation by the scoring metric. Previous work has concentrated on metrics for domains containing only discrete variables, under the assumption that data represents a multinomial sample. In this paper, we extend this work, developing scoring metrics for domains containing all continuous variables or a mixture of discrete and continuous variables, under the assumption that continuous data is sampled from a multivariate normal distribution. Our work extends traditional statistical approaches for identifying vanishing regression coefficients in that we identify two important assumptions, called event equivalence and parameter modularity, that when combined allow the construction of prior distributions for multivariate normal parameters from a single prior Bayesian network specified by a user.
Induction of Selective Bayesian Classifiers
In this paper, we examine previous work on the naive Bayesian classifier and review its limitations, which include a sensitivity to correlated features. We respond to this problem by embedding the naive Bayesian induction scheme within an algorithm that c arries out a greedy search through the space of features. We hypothesize that this approach will improve asymptotic accuracy in domains that involve correlated features without reducing the rate of learning in ones that do not. We report experimental results on six natural domains, including comparisons with decision-tree induction, that support these hypotheses. In closing, we discuss other approaches to extending naive Bayesian classifiers and outline some directions for future research.
Three Approaches to Probability Model Selection
Poland, William B., Shachter, Ross D.
This paper compares three approaches to the problem of selecting among probability models to fit data (1) use of statistical criteria such as Akaike's information criterion and Schwarz's "Bayesian information criterion," (2) maximization of the posterior probability of the model, and (3) maximization of an effectiveness ratio? trading off accuracy and computational cost. The unifying characteristic of the approaches is that all can be viewed as maximizing a penalized likelihood function. The second approach with suitable prior distributions has been shown to reduce to the first. This paper shows that the third approach reduces to the second for a particular form of the effectiveness ratio, and illustrates all three approaches with the problem of selecting the number of components in a mixture of Gaussian distributions. Unlike the first two approaches, the third can be used even when the candidate models are chosen for computational efficiency, without regard to physical interpretation, so that the likelihood and the prior distribution over models cannot be interpreted literally. As the most general and computationally oriented of the approaches, it is especially useful for artificial intelligence applications.
Convex vs nonconvex approaches for sparse estimation: GLasso, Multiple Kernel Learning and Hyperparameter GLasso
Aravkin, Aleksandr Y., Burke, James V., Chiuso, Alessandro, Pillonetto, Gianluigi
The popular Lasso approach for sparse estimation can be derived via marginalization of a joint density associated with a particular stochastic model. A different marginalization of the same probabilistic model leads to a different non-convex estimator where hyperparameters are optimized. Extending these arguments to problems where groups of variables have to be estimated, we study a computational scheme for sparse estimation that differs from the Group Lasso. Although the underlying optimization problem defining this estimator is non-convex, an initialization strategy based on a univariate Bayesian forward selection scheme is presented. This also allows us to define an effective non-convex estimator where only one scalar variable is involved in the optimization process. Theoretical arguments, independent of the correctness of the priors entering the sparse model, are included to clarify the advantages of this non-convex technique in comparison with other convex estimators. Numerical experiments are also used to compare the performance of these approaches.
Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint
Halkias, Xanadu, Paris, Sebastien, Glotin, Herve
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.