Collaborating Authors


Maximum likelihood estimation for Machine Learning - Nucleusbox


In the Logistic Regression for Machine Learning using Python blog, I have introduced the basic idea of the logistic function. We have discussed the cost function. And in the iterative method, we focus on the Gradient descent optimization method. Now so in this section, we are going to introduce the Maximum Likelihood cost function. And we would like to maximize this cost function.

Variational Bayes In Private Settings (VIPS)

Journal of Artificial Intelligence Research

Many applications of Bayesian data analysis involve sensitive information such as personal documents or medical records, motivating methods which ensure that privacy is protected. We introduce a general privacy-preserving framework for Variational Bayes (VB), a widely used optimization-based Bayesian inference method. Our framework respects differential privacy, the gold-standard privacy criterion, and encompasses a large class of probabilistic models, called the Conjugate Exponential (CE) family. We observe that we can straightforwardly privatise VB's approximate posterior distributions for models in the CE family, by perturbing the expected sufficient statistics of the complete-data likelihood. For a broadly-used class of non-CE models, those with binomial likelihoods, we show how to bring such models into the CE family, such that inferences in the modified model resemble the private variational Bayes algorithm as closely as possible, using the Pólya-Gamma data augmentation scheme. The iterative nature of variational Bayes presents a further challenge since iterations increase the amount of noise needed. We overcome this by combining: (1) an improved composition method for differential privacy, called the moments accountant, which provides a tight bound on the privacy cost of multiple VB iterations and thus significantly decreases the amount of additive noise; and (2) the privacy amplification effect of subsampling mini-batches from large-scale data in stochastic learning. We empirically demonstrate the effectiveness of our method in CE and non-CE models including latent Dirichlet allocation, Bayesian logistic regression, and sigmoid belief networks, evaluated on real-world datasets.

Machine learning for causal inference: on the use of cross-fit estimators Machine Learning

Modern causal inference methods allow machine learning to be used to weaken parametric modeling assumptions. However, the use of machine learning may result in bias and incorrect inferences due to overfitting. Cross-fit estimators have been proposed to eliminate this bias and yield better statistical properties. We conducted a simulation study to assess the performance of several different estimators for the average causal effect (ACE). The data generating mechanisms for the simulated treatment and outcome included log-transforms, polynomial terms, and discontinuities. We compared singly-robust estimators (g-computation, inverse probability weighting) and doubly-robust estimators (augmented inverse probability weighting, targeted maximum likelihood estimation). Nuisance functions were estimated with parametric models and ensemble machine learning, separately. We further assessed cross-fit doubly-robust estimators. With correctly specified parametric models, all of the estimators were unbiased and confidence intervals achieved nominal coverage. When used with machine learning, the cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage. Due to the difficulty of properly specifying parametric models in high dimensional data, doubly-robust estimators with ensemble learning and cross-fitting may be the preferred approach for estimation of the ACE in most epidemiologic studies. However, these approaches may require larger sample sizes to avoid finite-sample issues.

High-dimensional macroeconomic forecasting using message passing algorithms Machine Learning

As a response to the increasing linkages between the macroeconomy and the financial sector, as well as the expanding interconnectedness of the global economy, empirical macroeconomic models have increased both in complexity and size. For that reason, estimation of modern models that inform macroeconomic decisions - such as linear and nonlinear versions of dynamic stochastic general equilibrium (DSGE) and vector autoregressive (VAR) models - many times relies on Bayesian inference via powerful Markov chain Monte Carlo (MCMC) methods. 1 However, existing posterior simulation algorithms cannot scale up to very high-dimensions due to the computational inefficiency and the larger numerical error associated with repeated sampling via Monte Carlo; see Angelino et al. (2016) for a thorough review of such computational issues from a machine learning and high-dimensional data perspective. In that respect, while Bayesian inference is a natural probabilistic framework for learning about parameters by utilizing all information in the data likelihood and prior, computational restrictions might make it less suitable for supporting real-time decision-making in very high dimensions. This paper introduces to the econometric literature the framework of factor graphs (Kschischang et al., 2001) for the purpose of designing computationally efficient, and easy to maintain, Bayesian estimation algorithms. The focus is not only on "faster" posterior inference broadly interpreted, but on designing algorithms that have such low complexity that are future-proof and can be used in high-dimensional econometric problems with possibly thousands or millions of coefficients.

A Gamma-Poisson Mixture Topic Model for Short Text Machine Learning

Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.

Bayesian nonparametric modeling for predicting dynamic dependencies in multiple object tracking Machine Learning

Some challenging problems in tracking multiple objects include the time-dependent cardinality, unordered measurements and object parameter labeling. In this paper, we employ Bayesian Bayesian nonparametric methods to address these challenges. In particular, we propose modeling the multiple object parameter state prior using the dependent Dirichlet and Pitman-Yor processes. These nonparametric models have been shown to be more flexible and robust, when compared to existing methods, for estimating the time-varying number of objects, providing object labeling and identifying measurement to object associations. Monte Carlo sampling methods are then proposed to efficiently learn the trajectory of objects from noisy measurements. Using simulations, we demonstrate the estimation performance advantage of the new methods when compared to existing algorithms such as the generalized labeled multi-Bernoulli filter.

The R Package stagedtrees for Structural Learning of Stratified Staged Trees Machine Learning

In the past twenty years there has been an explosion of the use of graphical models to represent the relationship between a vector of random variables and perform distributed inference which takes advantage of the underlying graphical representations. Bayesian networks (BNs) (Darwiche 2009; Fenton and Neil 2012) are nowadays the most used graphical models, with applications to a wide array of domains and implementation in various software: for instance, the R packages bnlearn by Scutari (2010) and gRain by Højsgaard (2012), among others. However, BNs can only represent symmetric conditional independences which in practical applications may not be fully justified. For this reason, a variety of models that can take into account the asymmetric nature of real-world data have been proposed; for example, context-specific BNs (Boutilier, Friedman, Goldszmidt, and Koller 1996), labeled directed acyclic graphs (Pensar, Nyman, Koski, and Corander 2015) and probabilistic decision graphs (Jaeger, Nielsen, and Silander 2006). Unlike most of its competitors, the chain event graph (CEG) (Collazo, Görgen, and Smith 2018; Smith and Anderson 2008; Riccomagno and Smith 2004, 2009) can capture all (context-specific) conditional independences in a unique graph, obtained by a coalescence over the vertices of an appropriately constructed probability tree, called staged tree.

Learning from Aggregate Observations Machine Learning

We study the problem of learning from aggregate observations where supervision signals are given to sets of instances instead of individual instances, while the goal is still to predict labels of unseen individuals. A well-known example is multiple instance learning (MIL). In this paper, we extend MIL beyond binary classification to other problems such as multiclass classification and regression. We present a probabilistic framework that is applicable to a variety of aggregate observations, e.g., pairwise similarity for classification and mean/difference/rank observation for regression. We propose a simple yet effective method based on the maximum likelihood principle, which can be simply implemented for various differentiable models such as deep neural networks and gradient boosting machines. Experiments on three novel problem settings -- classification via triplet comparison and regression via mean/rank observation indicate the effectiveness of the proposed method.

Estimation of Classification Rules from Partially Classified Data Machine Learning

We consider the situation where the observed sample contains some observations whose class of origin is known (that is, they are classified with respect to the g underlying classes of interest), and where the remaining observations in the sample are unclassified (that is, their class labels are unknown). For class-conditional distributions taken to be known up to a vector of unknown parameters, the aim is to estimate the Bayes' rule of allocation for the allocation of subsequent unclassified observations. Estimation on the basis of both the classified and unclassified data can be undertaken in a straightforward manner by fitting a g-component mixture model by maximum likelihood (ML) via the EM algorithm in the situation where the observed data can be assumed to be an observed random sample from the adopted mixture distribution. This assumption applies if the missing-data mechanism is ignorable in the terminology pioneered by Rubin (1976). An initial likelihood approach was to use the so-called classification ML approach whereby the missing labels are taken to be parameters to be estimated along with the parameters of the class-conditional distributions. However, as it can lead to inconsistent estimates, the focus of attention switched to the mixture ML approach after the appearance of the EM algorithm (Dempster et al., 1977). Particular attention is given here to the asymptotic relative efficiency (ARE) of the Bayes' rule estimated from a partially classified sample. Lastly, we consider briefly some recent results in situations where the missing label pattern is non-ignorable for the purposes of ML estimation for the mixture model.

Optimal Learning for Sequential Decisions in Laboratory Experimentation Artificial Intelligence

The process of discovery in the physical, biological and medical sciences can be painstakingly slow. Most experiments fail, and the time from initiation of research until a new advance reaches commercial production can span 20 years. This tutorial is aimed to provide experimental scientists with a foundation in the science of making decisions. Using numerical examples drawn from the experiences of the authors, the article describes the fundamental elements of any experimental learning problem. It emphasizes the important role of belief models, which include not only the best estimate of relationships provided by prior research, previous experiments and scientific expertise, but also the uncertainty in these relationships. We introduce the concept of a learning policy, and review the major categories of policies. We then introduce a policy, known as the knowledge gradient, that maximizes the value of information from each experiment. We bring out the importance of reducing uncertainty, and illustrate this process for different belief models.