Goto

Collaborating Authors

 Directed Networks


Learning Directed Acyclic Graphs with Penalized Neighbourhood Regression

arXiv.org Machine Learning

We study a family of regularized score-based estimators for learning the structure of a directed acyclic graph (DAG) for a multivariate normal distribution from high-dimensional data with $p\gg n$. Our main results establish support recovery guarantees and deviation bounds for a family of penalized least-squares estimators under concave regularization without assuming prior knowledge of a variable ordering. These results apply to a variety of practical situations that allow for arbitrary nondegenerate covariance structures as well as many popular regularizers including the MCP, SCAD, $\ell_{0}$ and $\ell_{1}$. The proof relies on interpreting a DAG as a recursive linear structural equation model, which reduces the estimation problem to a series of neighbourhood regressions. We provide a novel statistical analysis of these neighbourhood problems, establishing uniform control over the superexponential family of neighbourhoods associated with a Gaussian distribution. We then apply these results to study the statistical properties of score-based DAG estimators, learning causal DAGs, and inferring conditional independence relations via graphical models. Our results yield---for the first time---finite-sample guarantees for structure learning of Gaussian DAGs in high-dimensions via score-based estimation.


Upper Bound of Bayesian Generalization Error in Non-negative Matrix Factorization

arXiv.org Machine Learning

Recently, nonnegative matrix factorization (NMF) [1, 2] has been applied to text mining [3], signal processing [4, 5, 6], bioinformatics [7], and consumer analysis [8]. Experiments has shown that a new knowledge discovery method is derived by NMF, however, its mathematical property as a learning machine is not yet clarified, since it is not a regular statistical model. A statistical model is called regular if a function from a parameter to a probability density function is one-to-one and if the likelihood function can be approximated by a Gaussian function. It is proved that, if a statistical model is regular and if a true distribution is realizable by a statistical model, then the generalization error is asymptotically equal to d/(2n), where d, n, and the generalization error are the dimension of the parameter, the sample size, and the expected Kullback-Leibler divergence of the true distribution and the estimated learning machine, respectively. However, the statistical model used in NMF is not regular because the map from a parameter to a probability density function is not injective.


Bayesian estimation from few samples: community detection and related problems

arXiv.org Machine Learning

We propose an efficient meta-algorithm for Bayesian estimation problems that is based on low-degree polynomials, semidefinite programming, and tensor decomposition. The algorithm is inspired by recent lower bound constructions for sum-of-squares and related to the method of moments. Our focus is on sample complexity bounds that are as tight as possible (up to additive lower-order terms) and often achieve statistical thresholds or conjectured computational thresholds. Our algorithm recovers the best known bounds for community detection in the sparse stochastic block model, a widely-studied class of estimation problems for community detection in graphs. We obtain the first recovery guarantees for the mixed-membership stochastic block model (Airoldi et el.) in constant average degree graphs---up to what we conjecture to be the computational threshold for this model. We show that our algorithm exhibits a sharp computational threshold for the stochastic block model with multiple communities beyond the Kesten--Stigum bound---giving evidence that this task may require exponential time. The basic strategy of our algorithm is strikingly simple: we compute the best-possible low-degree approximation for the moments of the posterior distribution of the parameters and use a robust tensor decomposition algorithm to recover the parameters from these approximate posterior moments.


Bayesian Learning for Statistical Classification โ€“ Stats and Bots

#artificialintelligence

A well-calibrated estimator for the conditional probabilities should obey this equation. Once we have derived a statistical classifier, we need to validate it on some test data. This data should be different from that used to train the classifier, otherwise skill scores will be unduly optimistic. This is known as cross-validation. The confusion matrix expresses everything about the accuracy of a discrete classifier over a given database and you can use it to compose any possible skill score. Here, we are going to cover two that are rarely seen in the literature, but are nonetheless important for reasons that will become clear.


Distance-based Confidence Score for Neural Network Classifiers

arXiv.org Machine Learning

The reliable measurement of confidence in classifiers' predictions is very important for many applications and is, therefore, an important part of classifier design. Yet, although deep learning has received tremendous attention in recent years, not much progress has been made in quantifying the prediction confidence of neural network classifiers. Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with prohibitive computational costs. In this paper we propose a simple, scalable method to achieve a reliable confidence score, based on the data embedding derived from the penultimate layer of the network. We investigate two ways to achieve desirable embeddings, by using either a distance-based loss or Adversarial Training. We then test the benefits of our method when used for classification error prediction, weighting an ensemble of classifiers, and novelty detection. In all tasks we show significant improvement over traditional, commonly used confidence scores.


The Mathematics of Machine Learning

#artificialintelligence

Finally, the main aim of this blog post is to give a well-intentioned advice about the importance of Mathematics in Machine Learning and the necessary topics and useful resources for a mastery of these topics. However, some Machine Learning enthusiasts are novice in Maths and will probably find this post disheartening (seriously, this is not my aim). For beginners, you don't need a lot of Mathematics to start doing Machine Learning. The fundamental prerequisite is data analysis as described in this blog post and you can learn the maths on the go as you master more techniques and algorithms. This entry was originally published on my LinkedIn page.


On the Semantics and Complexity of Probabilistic Logic Programs

Journal of Artificial Intelligence Research

We examine the meaning and the complexity of probabilistic logic programs that consist of a set of rules and a set of independent probabilistic facts (that is, programs based on Sato's distribution semantics). We focus on two semantics, respectively based on stable and on well-founded models. We show that the semantics based on stable models (referred to as the "credal semantics") produces sets of probability measures that dominate infinitely monotone Choquet capacities; we describe several useful consequences of this result. We then examine the complexity of inference with probabilistic logic programs. We distinguish between the complexity of inference when a probabilistic program and a query are given (the inferential complexity), and the complexity of inference when the probabilistic program is fixed and the query is given (the query complexity, akin to data complexity as used in database theory). We obtain results on the inferential and query complexity for acyclic, stratified, and normal propositional and relational programs; complexity reaches various levels of the counting hierarchy and even exponential levels.


The detour problem in a stochastic environment: Tolman revisited

arXiv.org Machine Learning

We designed a grid world task to study human planning and re-planning behavior in an unknown stochastic environment. In our grid world, participants were asked to travel from a random starting point to a random goal position while maximizing their reward. Because they were not familiar with the environment, they needed to learn its characteristics from experience to plan optimally. Later in the task, we randomly blocked the optimal path to investigate whether and how people adjust their original plans to find a detour. To this end, we developed and compared 12 different models. These models were different on how they learned and represented the environment and how they planned to catch the goal. The majority of our participants were able to plan optimally. We also showed that people were capable of revising their plans when an unexpected event occurred. The result from the model comparison showed that the model-based reinforcement learning approach provided the best account for the data and outperformed heuristics in explaining the behavioral data in the re-planning trials.


Multi-way Interacting Regression via Factorization Machines

arXiv.org Machine Learning

We propose a Bayesian regression method that accounts for multi-way interactions of arbitrary orders among the predictor variables. Our model makes use of a factorization mechanism for representing the regression coefficients of interactions among the predictors, while the interaction selection is guided by a prior distribution on random hypergraphs, a construction which generalizes the Finite Feature Model. We present a posterior inference algorithm based on Gibbs sampling, and establish posterior consistency of our regression model. Our method is evaluated with extensive experiments on simulated data and demonstrated to be able to identify meaningful interactions in applications in genetics and retail demand forecasting.


Learning Multi-grid Generative ConvNets by Minimal Contrastive Divergence

arXiv.org Machine Learning

This paper proposes a minimal contrastive divergence method for learning energy-based generative ConvNet models of images at multiple grids (or scales) simultaneously. For each grid, we learn an energy-based probabilistic model where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). Learning such a model requires generating synthesized examples from the model. Within each iteration of our learning algorithm, for each observed training image, we generate synthesized images at multiple grids by initializing the finite-step MCMC sampling from a minimal 1 x 1 version of the training image. The synthesized image at each subsequent grid is obtained by a finite-step MCMC initialized from the synthesized image generated at the previous coarser grid. After obtaining the synthesized examples, the parameters of the models at multiple grids are updated separately and simultaneously based on the differences between synthesized and observed examples. We call this learning method the multi-grid minimal contrastive divergence. We show that this method can learn realistic energy-based generative ConvNet models, and it outperforms the original contrastive divergence (CD) and persistent CD.