Learning Graphical Models
Vertex nomination schemes for membership prediction
Fishkind, D. E., Lyzinski, V., Pao, H., Chen, L., Priebe, C. E.
Suppose that a graph is realized from a stochastic block model where one of the blocks is of interest, but many or all of the vertices' block labels are unobserved. The task is to order the vertices with unobserved block labels into a "nomination list" such that, with high probability, vertices from the interesting block are concentrated near the list's beginning. We propose several vertex nomination schemes. Our basic--but principled--setting and development yields a best nomination scheme (which is a Bayes-Optimal analogue), and also a likelihood maximization nomination scheme that is practical to implement when there are a thousand vertices, and which is empirically near-optimal when the number of vertices is small enough to allow comparison to the best nomination scheme. We then illustrate the robustness of the likelihood maximization nomination scheme to the modeling challenges inherent in real data, using examples which include a social network involving human trafficking, the Enron Graph, a worm brain connectome and a political blog network. In a stochastic block model, the vertices of the graph are partitioned into blocks, and the existence/nonexistence of an edge between any pair of vertices is an independent Bernoulli trial, with the Bernoulli parameter being a function of the block memberships of the pair of vertices. We are concerned here with a graph realized from a stochastic block model such that many or all of the vertices' block labels are hidden (i.e., unobserved). Received August 2014; revised February 2015. Supported in part by Johns Hopkins University Human Language Technology Center of Excellence (JHU HLT COE) and the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750-12-2-0303.
Accelerating pseudo-marginal Metropolis-Hastings by correlating auxiliary variables
Dahlin, Johan, Lindsten, Fredrik, Kronander, Joel, Schön, Thomas B.
Pseudo-marginal Metropolis-Hastings (pmMH) is a powerful method for Bayesian inference in models where the posterior distribution is analytical intractable or computationally costly to evaluate directly. It operates by introducing additional auxiliary variables into the model and form an extended target distribution, which then can be evaluated point-wise. In many cases, the standard Metropolis-Hastings is then applied to sample from the extended target and the sought posterior can be obtained by marginalisation. However, in some implementations this approach suffers from poor mixing as the auxiliary variables are sampled from an independent proposal. We propose a modification to the pmMH algorithm in which a Crank-Nicolson (CN) proposal is used instead. This results in that we introduce a positive correlation in the auxiliary variables. We investigate how to tune the CN proposal and its impact on the mixing of the resulting pmMH sampler. The conclusion is that the proposed modification can have a beneficial effect on both the mixing of the Markov chain and the computational cost for each iteration of the pmMH algorithm.
Binary Classifier Calibration using an Ensemble of Near Isotonic Regression Models
Naeini, Mahdi Pakdaman, Cooper, Gregory F.
Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called \textit{ensemble of near isotonic regression} (ENIR). The method can be considered as an extension of BBQ, a recently proposed calibration method, as well as the commonly used calibration method based on isotonic regression. ENIR is designed to address the key limitation of isotonic regression which is the monotonicity assumption of the predictions. Similar to BBQ, the method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus it can be combined with many existing classification models. We demonstrate the performance of ENIR on synthetic and real datasets for the commonly used binary classification models. Experimental results show that the method outperforms several common binary classifier calibration methods. In particular on the real data, ENIR commonly performs statistically significantly better than the other methods, and never worse. It is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is $O(N \log N)$ time, where $N$ is the number of samples.
Resolving the Geometric Locus Dilemma for Support Vector Learning Machines
Capacity control, the bias/variance dilemma, and learning unknown functions from data, are all concerned with identifying effective and consistent fits of unknown geometric loci to random data points. A geometric locus is a curve or surface formed by points, all of which possess some uniform property. A geometric locus of an algebraic equation is the set of points whose coordinates are solutions of the equation. Any given curve or surface must pass through each point on a specified locus. This paper argues that it is impossible to fit random data points to algebraic equations of partially configured geometric loci that reference arbitrary Cartesian coordinate systems. It also argues that the fundamental curve of a linear decision boundary is actually a principal eigenaxis. It is shown that learning principal eigenaxes of linear decision boundaries involves finding a point of statistical equilibrium for which eigenenergies of principal eigenaxis components are symmetrically balanced with each other. It is demonstrated that learning linear decision boundaries involves strong duality relationships between a statistical eigenlocus of principal eigenaxis components and its algebraic forms, in primal and dual, correlated Hilbert spaces. Locus equations are introduced and developed that describe principal eigen-coordinate systems for lines, planes, and hyperplanes. These equations are used to introduce and develop primal and dual statistical eigenlocus equations of principal eigenaxes of linear decision boundaries. Important generalizations for linear decision boundaries are shown to be encoded within a dual statistical eigenlocus of principal eigenaxis components. Principal eigenaxes of linear decision boundaries are shown to encode Bayes' likelihood ratio for common covariance data and a robust likelihood ratio for all other data.
Accurate estimation of influenza epidemics using Google search data via ARGO
Yang, Shihao, Santillana, Mauricio, Kou, S. C.
Accurate real-time tracking of influenza outbreaks helps public health officials make timely and meaningful decisions that could save lives. We propose an influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available Google-searchbased tracking models, including the latest version of Google Flu Trends, even though it uses only low-quality search data as input from publicly available Google Trends and Google Correlate websites. ARGO not only incorporates the seasonality in influenza epidemics but also captures changes in peoples online search behavior over time. ARGO is also flexible, self-correcting, robust, and scalable, making it a potentially powerful tool that can be used for real-time tracking of other social events at multiple temporal and spatial resolutions. There are some minor differences between this preprint and the published paper. Big data sets are constantly generated nowadays as the activities of millions of users are collected from internet-based services. Numerous studies have suggested great potential of these big data sets to detect/manage epidemic outbreaks (influenza [1, 2, 3, 4, 5, 6], Ebola [7], dengue [8]), predict changes in stock prices [9, 10] and housing prices [11], etc. In 2009, Google Flu Trends (GFT), a digital disease detection system that uses the volume of selected Google search terms to estimate current influenza-like illnesses (ILI) activity, was identified by many as a good example of how big data would transform traditional statistical predictive analysis [12].
Neural Adaptive Sequential Monte Carlo
Gu, Shixiang, Ghahramani, Zoubin, Turner, Richard E.
Sequential Monte Carlo (SMC), or particle filtering, is a popular class of methods for sampling from an intractable target distribution using a sequence of simpler intermediate distributions. Like other importance sampling-based methods, performance is critically dependent on the proposal distribution: a bad proposal can lead to arbitrarily inaccurate estimates of the target distribution. This paper presents a new method for automatically adapting the proposal using an approximation of the Kullback-Leibler divergence between the true posterior and the proposal distribution. The method is very flexible, applicable to any parameterized proposal distribution and it supports online and batch variants. We use the new framework to adapt powerful proposal distributions with rich parameterizations based upon neural networks leading to Neural Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC significantly improves inference in a non-linear state space model outperforming adaptive proposal methods including the Extended Kalman and Unscented Particle Filters. Experiments also indicate that improved inference translates into improved parameter learning when NASMC is used as a subroutine of Particle Marginal Metropolis Hastings. Finally we show that NASMC is able to train a latent variable recurrent neural network (LV-RNN) achieving results that compete with the state-of-the-art for polymorphic music modelling. NASMC can be seen as bridging the gap between adaptive SMC methods and the recent work in scalable, black-box variational inference.
How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?
Modern applications and progress in deep learning research have created renewed interest for generative models of text and of images. However, even today it is unclear what objective functions one should use to train and evaluate these models. In this paper we present two contributions. Firstly, we present a critique of scheduled sampling, a state-of-the-art training method that contributed to the winning entry to the MSCOCO image captioning benchmark in 2015. Here we show that despite this impressive empirical performance, the objective function underlying scheduled sampling is improper and leads to an inconsistent learning algorithm. Secondly, we revisit the problems that scheduled sampling was meant to address, and present an alternative interpretation. We argue that maximum likelihood is an inappropriate training objective when the end-goal is to generate natural-looking samples. We go on to derive an ideal objective function to use in this situation instead. We introduce a generalisation of adversarial training, and show how such method can interpolate between maximum likelihood training and our ideal training objective. To our knowledge this is the first theoretical analysis that explains why adversarial training tends to produce samples with higher perceived quality.
Probabilistic Segmentation via Total Variation Regularization
We present a convex approach to probabilistic segmentation and modeling of time series data. Our approach builds upon recent advances in multivariate total variation regularization, and seeks to learn a separate set of parameters for the distribution over the observations at each time point, but with an additional penalty that encourages the parameters to remain constant over time. We propose efficient optimization methods for solving the resulting (large) optimization problems, and a two-stage procedure for estimating recurring clusters under such models, based upon kernel density estimation. Finally, we show on a number of real-world segmentation tasks, the resulting methods often perform as well or better than existing latent variable models, while being substantially easier to train. 1 Introduction In this paper, we consider the tasks of time series segmentation and modeling. Formally, suppose that we observe a sequence ofT input/output pairs, represented as (x 1,y 1), (x 2,y 2),..., (x T,y T) (1) forx t R n (which can even include functions of past outputs of the time series to capture scenarios such as autoregressive models) andy t R p (though we can also consider other possible forms of the output vector, such as categorical variables).
Bayesian Analysis of Dynamic Linear Topic Models
Glynn, Chris, Tokdar, Surya T., Banks, David L., Howard, Brian
In dynamic topic modeling, the proportional contribution of a topic to a document depends on the temporal dynamics of that topic's overall prevalence in the corpus. We extend the Dynamic Topic Model of Blei and Lafferty (2006) by explicitly modeling document level topic proportions with covariates and dynamic structure that includes polynomial trends and periodicity. A Markov Chain Monte Carlo (MCMC) algorithm that utilizes Polya-Gamma data augmentation is developed for posterior inference. Conditional independencies in the model and sampling are made explicit, and our MCMC algorithm is parallelized where possible to allow for inference in large corpora. To address computational bottlenecks associated with Polya-Gamma sampling, we appeal to the Central Limit Theorem to develop a Gaussian approximation to the Polya-Gamma random variable. This approximation is fast and reliable for parameter values relevant in the text mining domain. Our model and inference algorithm are validated with multiple simulation examples, and we consider the application of modeling trends in PubMed abstracts. We demonstrate that sharing information across documents is critical for accurately estimating document-specific topic proportions. We also show that explicitly modeling polynomial and periodic behavior improves our ability to predict topic prevalence at future time points.
Bayesian group latent factor analysis with structured sparsity
Zhao, Shiwen, Gao, Chuan, Mukherjee, Sayan, Engelhardt, Barbara E
Latent factor models are the canonical statistical tool for exploratory analyses of low-dimensional linear structure for an observation matrix with p features across n samples. We develop a structured Bayesian group factor analysis model that extends the factor model to multiple coupled observation matrices; in the case of two observations, this reduces to a Bayesian model of canonical correlation analysis. The main contribution of this work is to carefully define a structured Bayesian prior that encourages both element-wise and column-wise shrinkage and leads to desirable behavior on high-dimensional data. In particular, our model puts a structured prior on the joint factor loading matrix, regularizing at three levels, which enables element-wise sparsity and unsupervised recovery of latent factors corresponding to structured variance across arbitrary subsets of the observations. In addition, our structured prior allows for both dense and sparse latent factors so that covariation among either all features or only a subset of features can both be recovered. We use fast parameter-expanded expectation-maximization for parameter estimation in this model. We validate our method on both simulated data with substantial structure and real data, comparing against a number of state-of-the-art approaches. These results illustrate useful properties of our model, including i) recovering sparse signal in the presence of dense effects; ii) the ability to scale naturally to large numbers of observations; iii) flexible observation- and factor-specific regularization to recover factors with a wide variety of sparsity levels and percentage of variance explained; and iv) tractable inference that scales to modern genomic and document data sizes.