Goto

Collaborating Authors

 Learning Graphical Models


Indirect Gaussian Graph Learning beyond Gaussianity

arXiv.org Machine Learning

This paper studies how to capture dependency graph structures from real data which may not be multivariate Gaussian. Starting from marginal loss functions not necessarily derived from probability distributions, we use an additive over-parametrization with shrinkage to incorporate variable dependencies into the criterion. An iterative Gaussian graph learning algorithm is proposed with ease in implementation. Statistical analysis shows that with the error measured in terms of a proper Bregman divergence, the estimators have fast rate of convergence. Real-life examples in different settings are given to demonstrate the efficacy of the proposed methodology.


[Discussion] How Gaussian naïve Bayes forms a non-linear decision boundary? • /r/MachineLearning

@machinelearnbot

Discussion[Discussion] How Gaussian naïve Bayes forms a non-linear decision boundary? Also, please explain decision boundary for decision trees. If the two gaussians are non isotropic you can derive that the bound is quadratic/elliptic curve.


Fast $\epsilon$-free Inference of Simulation Models with Bayesian Conditional Density Estimation

arXiv.org Machine Learning

Many statistical models can be simulated forwards but have intractable likelihoods. Approximate Bayesian Computation (ABC) methods are used to infer properties of these models from data. Traditionally these methods approximate the posterior over parameters by conditioning on data being inside an $\epsilon$-ball around the observed data, which is only correct in the limit $\epsilon\!\rightarrow\!0$. Monte Carlo methods can then draw samples from the approximate posterior to approximate predictions or error bars on parameters. These algorithms critically slow down as $\epsilon\!\rightarrow\!0$, and in practice draw samples from a broader distribution than the posterior. We propose a new approach to likelihood-free inference based on Bayesian conditional density estimation. Preliminary inferences based on limited simulation data are used to guide later simulations. In some cases, learning an accurate parametric representation of the entire true posterior distribution requires fewer model simulations than Monte Carlo ABC methods need to produce a single sample from an approximate posterior.


A Bayesian Ensemble for Unsupervised Anomaly Detection

arXiv.org Machine Learning

Methods for unsupervised anomaly detection suffer from the fact that the data is unlabeled, making it difficult to assess the optimality of detection algorithms. Ensemble learning has shown exceptional results in classification and clustering problems, but has not seen as much research in the context of outlier detection. Existing methods focus on combining output scores of individual detectors, but this leads to outputs that are not easily interpretable. In this paper, we introduce a theoretical foundation for combining individual detectors with Bayesian classifier combination. Not only are posterior distributions easily interpreted as the probability distribution of anomalies, but bias, variance, and individual error rates of detectors are all easily obtained. Performance on real-world datasets shows high accuracy across varied types of time series data.


Characteristic Kernels and Infinitely Divisible Distributions

arXiv.org Machine Learning

We connect shift-invariant characteristic kernels to infinitely divisible distributions on $\mathbb{R}^{d}$. Characteristic kernels play an important role in machine learning applications with their kernel means to distinguish any two probability measures. The contribution of this paper is two-fold. First, we show, using the L\'evy-Khintchine formula, that any shift-invariant kernel given by a bounded, continuous and symmetric probability density function (pdf) of an infinitely divisible distribution on $\mathbb{R}^d$ is characteristic. We also present some closure property of such characteristic kernels under addition, pointwise product, and convolution. Second, in developing various kernel mean algorithms, it is fundamental to compute the following values: (i) kernel mean values $m_P(x)$, $x \in \mathcal{X}$, and (ii) kernel mean RKHS inner products ${\left\langle m_P, m_Q \right\rangle_{\mathcal{H}}}$, for probability measures $P, Q$. If $P, Q$, and kernel $k$ are Gaussians, then computation (i) and (ii) results in Gaussian pdfs that is tractable. We generalize this Gaussian combination to more general cases in the class of infinitely divisible distributions. We then introduce a {\it conjugate} kernel and {\it convolution trick}, so that the above (i) and (ii) have the same pdf form, expecting tractable computation at least in some cases. As specific instances, we explore $\alpha$-stable distributions and a rich class of generalized hyperbolic distributions, where the Laplace, Cauchy and Student-t distributions are included.


Markov Chain methods for the bipartite Boolean quadratic programming problem

arXiv.org Artificial Intelligence

We study the Bipartite Boolean Quadratic Programming Problem (BBQP) which is an extension of the well known Boolean Quadratic Programming Problem (BQP). Applications of the BBQP include mining discrete patterns from binary data, approximating matrices by rank-one binary matrices, computing the cut-norm of a matrix, and solving optimisation problems such as maximum weight biclique, bipartite maximum weight cut, maximum weight induced subgraph of a bipartite graph, etc. For the BBQP, we first present several algorithmic components, specifically, hill climbers and mutations, and then show how to combine them in a high-performance metaheuristic. Instead of hand-tuning a standard metaheuristic to test the efficiency of the hybrid of the components, we chose to use an automated generation of a multi-component metaheuristic to save human time, and also improve objectivity in the analysis and comparisons of components. For this we designed a new metaheuristic schema which we call Conditional Markov Chain Search (CMCS). We show that CMCS is flexible enough to model several standard metaheuristics; this flexibility is controlled by multiple numeric parameters, and so is convenient for automated generation. We study the configurations revealed by our approach and show that the best of them outperforms the previous state-of-the-art BBQP algorithm by several orders of magnitude. In our experiments we use benchmark instances introduced in the preliminary version of this paper and described here, which have already become the de facto standard in the BBQP literature. Keywords: artificial intelligence, bipartite Boolean quadratic programming, automated heuristic configuration, benchmark 1. Introduction The (Unconstrained) Boolean Quadratic Programming Problem (BQP) is to maximise f(x) x The BQP is a well-studied problem in the operational research literature [6]. The focus of this paper is on a problem closely related to BQP, called the Bipartite (Unconstrained) Boolean Quadratic Programming Problem (BBQP) [23]. A graph theoretic interpretation of the BBQP can be given as follows [23]. Consider a bipartite graph G (I, J, E). M otherwise, where M is a large positive constant. Then BBQP(Q, c, d) solves the MWBP [23].


Focused Model-Learning and Planning for Non-Gaussian Continuous State-Action Systems

arXiv.org Machine Learning

We introduce a framework for model learning and planning in stochastic domains with continuous state and action spaces and non-Gaussian transition models. It is efficient because (1) local models are estimated only when the planner requires them; (2) the planner focuses on the most relevant states to the current planning problem; and (3) the planner focuses on the most informative and/or high-value actions. Our theoretical analysis shows the validity and asymptotic optimality of the proposed approach. Empirically, we demonstrate the effectiveness of our algorithm on a simulated multi-modal pushing problem.


Simpler PAC-Bayesian Bounds for Hostile Data

arXiv.org Machine Learning

Learning theory can be traced back to the late 60s and has attracted a great attention since. We refer to the monographs Devroye et al. (1996) and Vapnik (2000) for a survey. Most of the literature addresses the simplified case of i.i.d observations coupled with bounded loss functions. Many bounds on the excess risk holding with large probability were provided - these bounds are refered to as PAC learning bounds since Valiant (1984). In the late 90s, the PAC-Bayesian approach has been pioneered by Shawe-Taylor and Williamson (1997) and McAllester (1998, 1999). It consists in producing PAC bounds for a specific class of Bayesian-flavored estimators. Similarly to classical PAC results, most PAC-Bayesian bounds have been obtained with bounded loss functions (see Catoni, 2007, for some of the most accurate results). Note that Catoni (2004) provides bounds for unbouded loss, but still under very strong exponential moments assumptions. These assumptions were essentially not improved in the most recent works Guedj and Alquier (2013) and Bégin et al. (2016).


Boltzmann-Machine Learning of Prior Distributions of Binarized Natural Images

arXiv.org Machine Learning

Prior distributions of binarized natural images are learned by using a Boltzmann machine. According the results of this study, there emerges a structure with two sublattices in the interactions, and the nearest-neighbor and next-nearest-neighbor interactions correspondingly take two discriminative values, which reflects the individual characteristics of the three sets of pictures that we process. Meanwhile, in a longer spatial scale, a longer-range, although still rapidly decaying, ferromagnetic interaction commonly appears in all cases. The characteristic length scale of the interactions is universally up to approximately four lattice spacings $\xi \approx 4$. These results are derived by using the mean-field method, which effectively reduces the computational time required in a Boltzmann machine. An improved mean-field method called the Bethe approximation also gives the same results, as well as the Monte Carlo method does for small size images. These reinforce the validity of our analysis and findings. Relations to criticality, frustration, and simple-cell receptive fields are also discussed.


Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs

arXiv.org Machine Learning

The sizes of Markov equivalence classes of directed acyclic graphs play important roles in measuring the uncertainty and complexity in causal learning. A Markov equivalence class can be represented by an essential graph and its undirected subgraphs determine the size of the class. In this paper, we develop a method to derive the formulas for counting the sizes of Markov equivalence classes. We first introduce a new concept of core graph. The size of a Markov equivalence class of interest is a polynomial of the number of vertices given its core graph. Then, we discuss the recursive and explicit formula of the polynomial, and provide an algorithm to derive the size formula via symbolic computation for any given core graph. The proposed size formula derivation sheds light on the relationships between the size of a Markov equivalence class and its representation graph, and makes size counting efficient, even when the essential graphs contain non-sparse undirected subgraphs.