Country
The Hierarchical Dirichlet Process Hidden Semi-Markov Model
Johnson, Matthew J., Willsky, Alan
There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the traditional HMM. However, in many settings the HDP-HMM's strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markovianity, which has been developed in the parametric setting to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicitduration HDP-HSMM and develop posterior sampling algorithms for efficient inference in both the direct-assignment and weak-limit approximation settings. We demonstrate the utility of the model and our inference methods on synthetic data as well as experiments on a speaker diarization problem and an example of learning the patterns in Morse code.
Invariant Gaussian Process Latent Variable Models and Application in Causal Discovery
Zhang, Kun, Schoelkopf, Bernhard, Janzing, Dominik
In nonlinear latent variable models or dynamic models, if we consider the latent variables as confounders (common causes), the noise dependencies imply further relations between the observed variables. Such models are then closely related to causal discovery in the presence of nonlinear confounders, which is a challenging problem. However, generally in such models the observation noise is assumed to be independent across data dimensions, and consequently the noise dependencies are ignored. In this paper we focus on the Gaussian process latent variable model (GPLVM), from which we develop an extended model called invariant GPLVM (IGPLVM), which can adapt to arbitrary noise covariances. With the Gaussian process prior put on a particular transformation of the latent nonlinear functions, instead of the original ones, the algorithm for IGPLVM involves almost the same computational loads as that for the original GPLVM. Besides its potential application in causal discovery, IGPLVM has the advantage that its estimated latent nonlinear manifold is invariant to any nonsingular linear transformation of the data. Experimental results on both synthetic and realworld data show its encouraging performance in nonlinear manifold learning and causal discovery.
Combining Spatial and Telemetric Features for Learning Animal Movement Models
Kapicioglu, Berk, Schapire, Robert E., Wikelski, Martin, Broderick, Tamara
We introduce a new graphical model for tracking radio-tagged animals and learning their movement patterns. The model provides a principled way to combine radio telemetry data with an arbitrary set of userdefined, spatial features. We describe an efficient stochastic gradient algorithm for fitting model parameters to data and demonstrate its effectiveness via asymptotic analysis and synthetic experiments. We also apply our model to real datasets, and show that it outperforms the most popular radio telemetry software package used in ecology. We conclude that integration of different data sources under a single statistical framework, coupled with appropriate parameter and state estimation procedures, produces both accurate location estimates and an interpretable statistical model of animal movement.
Algorithms and Complexity Results for Exact Bayesian Structure Learning
Ordyniak, Sebastian, Szeider, Stefan
Bayesian structure learning is the NP-hard problem of discovering a Bayesian network that optimally represents a given set of training data. In this paper we study the computational worst-case complexity of exact Bayesian structure learning under graph theoretic restrictions on the super-structure. The super-structure (a concept introduced by Perrier, Imoto, and Miyano, JMLR 2008) is an undirected graph that contains as subgraphs the skeletons of solution networks. Our results apply to several variants of score-based Bayesian structure learning where the score of a network decomposes into local scores of its nodes. Results: We show that exact Bayesian structure learning can be carried out in non-uniform polynomial time if the super-structure has bounded treewidth and in linear time if in addition the super-structure has bounded maximum degree. We complement this with a number of hardness results. We show that both restrictions (treewidth and degree) are essential and cannot be dropped without loosing uniform polynomial time tractability (subject to a complexity-theoretic assumption). Furthermore, we show that the restrictions remain essential if we do not search for a globally optimal network but we aim to improve a given network by means of at most k arc additions, arc deletions, or arc reversals (k-neighborhood local search).
Learning, Social Intelligence and the Turing Test - why an "out-of-the-box" Turing Machine will not pass the Turing Test
Edmonds, Bruce, Gershenson, Carlos
The Turing Test (TT) checks for human intelligence, rather than any putative general intelligence. It involves repeated interaction requiring learning in the form of adaption to the human conversation partner. It is a macro-level post-hoc test in contrast to the definition of a Turing Machine (TM), which is a prior micro-level definition. This raises the question of whether learning is just another computational process, i.e. can be implemented as a TM. Here we argue that learning or adaption is fundamentally different from computation, though it does involve processes that can be seen as computations. To illustrate this difference we compare (a) designing a TM and (b) learning a TM, defining them for the purpose of the argument. We show that there is a well-defined sequence of problems which are not effectively designable but are learnable, in the form of the bounded halting problem. Some characteristics of human intelligence are reviewed including it's: interactive nature, learning abilities, imitative tendencies, linguistic ability and context-dependency. A story that explains some of these is the Social Intelligence Hypothesis. If this is broadly correct, this points to the necessity of a considerable period of acculturation (social learning in context) if an artificial intelligence is to pass the TT. Whilst it is always possible to 'compile' the results of learning into a TM, this would not be a designed TM and would not be able to continually adapt (pass future TTs). We conclude three things, namely that: a purely "designed" TM will never pass the TT; that there is no such thing as a general intelligence since it necessary involves learning; and that learning/adaption and computation should be clearly distinguished.
Inference by Minimizing Size, Divergence, or their Sum
Riedel, Sebastian, Smith, David A., McCallum, Andrew
We speed up marginal inference by ignoring factors that do not significantly contribute to overall accuracy. In order to pick a suitable subset of factors to ignore, we propose three schemes: minimizing the number of model factors under a bound on the KL divergence between pruned and full models; minimizing the KL divergence under a bound on factor count; and minimizing the weighted sum of KL divergence and factor count. All three problems are solved using an approximation of the KL divergence than can be calculated in terms of marginals computed on a simple seed graph. Applied to synthetic image denoising and to three different types of NLP parsing models, this technique performs marginal inference up to 11 times faster than loopy BP, with graph sizes reduced up to 98%-at comparable error in marginals and parsing accuracy. We also show that minimizing the weighted sum of divergence and size is substantially faster than minimizing either of the other objectives based on the approximation to divergence presented here.
A Proximal-Gradient Homotopy Method for the Sparse Least-Squares Problem
We consider solving the $\ell_1$-regularized least-squares ($\ell_1$-LS) problem in the context of sparse recovery, for applications such as compressed sensing. The standard proximal gradient method, also known as iterative soft-thresholding when applied to this problem, has low computational cost per iteration but a rather slow convergence rate. Nevertheless, when the solution is sparse, it often exhibits fast linear convergence in the final stage. We exploit the local linear convergence using a homotopy continuation strategy, i.e., we solve the $\ell_1$-LS problem for a sequence of decreasing values of the regularization parameter, and use an approximate solution at the end of each stage to warm start the next stage. Although similar strategies have been studied in the literature, there have been no theoretical analysis of their global iteration complexity. This paper shows that under suitable assumptions for sparse recovery, the proposed homotopy strategy ensures that all iterates along the homotopy solution path are sparse. Therefore the objective function is effectively strongly convex along the solution path, and geometric convergence at each stage can be established. As a result, the overall iteration complexity of our method is $O(\log(1/\epsilon))$ for finding an $\epsilon$-optimal solution, which can be interpreted as global geometric rate of convergence. We also present empirical results to support our theoretical analysis.
Combining Voting Rules Together
Narodytska, Nina, Walsh, Toby, Xia, Lirong
We propose a simple method for combining together voting rules that performs a run-off between the different winners of each voting rule. We prove that this combinator has several good properties. For instance, even if just one of the base voting rules has a desirable property like Condorcet consistency, the combination inherits this property. In addition, we prove that combining voting rules together in this way can make finding a manipulation more computationally difficult. Finally, we study the impact of this combinator on approximation methods that find close to optimal manipulations.
Generalized Beta Mixtures of Gaussians
Armagan, Artin, Dunson, David B., Clyde, Merlise
In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely.
A Generalized Least Squares Matrix Decomposition
Allen, Genevera I., Grosenick, Logan, Taylor, Jonathan
Variables in many massive high-dimensional data sets are structured, arising for example from measurements on a regular grid as in imaging and time series or from spatial-temporal measurements as in climate studies. Classical multivariate techniques ignore these structural relationships often resulting in poor performance. We propose a generalization of the singular value decomposition (SVD) and principal components analysis (PCA) that is appropriate for massive data sets with structured variables or known two-way dependencies. By finding the best low rank approximation of the data with respect to a transposable quadratic norm, our decomposition, entitled the Generalized least squares Matrix Decomposition (GMD), directly accounts for structural relationships. As many variables in high-dimensional settings are often irrelevant or noisy, we also regularize our matrix decomposition by adding two-way penalties to encourage sparsity or smoothness. We develop fast computational algorithms using our methods to perform generalized PCA (GPCA), sparse GPCA, and functional GPCA on massive data sets. Through simulations and a whole brain functional MRI example we demonstrate the utility of our methodology for dimension reduction, signal recovery, and feature selection with high-dimensional structured data.