Undirected Networks
Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Allahverdyan, Armen E., Galstyan, Aram
We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam's razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters.
High-Dimensional Covariance Decomposition into Sparse Markov and Independence Models
Janzamin, Majid, Anandkumar, Animashree
Fitting high-dimensional data involves a delicate tradeoff between faithful representation and the use of sparse models. Too often, sparsity assumptions on the fitted model are too restrictive to provide a faithful representation of the observed data. In this paper, we present a novel framework incorporating sparsity in different domains.We decompose the observed covariance matrix into a sparse Gaussian Markov model (with a sparse precision matrix) and a sparse independence model (with a sparse covariance matrix). Our framework incorporates sparse covariance and sparse precision estimation as special cases and thus introduces a richer class of high-dimensional models. We characterize sufficient conditions for identifiability of the two models, \viz Markov and independence models. We propose an efficient decomposition method based on a modification of the popular $\ell_1$-penalized maximum-likelihood estimator ($\ell_1$-MLE). We establish that our estimator is consistent in both the domains, i.e., it successfully recovers the supports of both Markov and independence models, when the number of samples $n$ scales as $n = \Omega(d^2 \log p)$, where $p$ is the number of variables and $d$ is the maximum node degree in the Markov model. Our experiments validate these results and also demonstrate that our models have better inference accuracy under simple algorithms such as loopy belief propagation.
Bayesian Structural Inference for Hidden Processes
Strelioff, Christopher C., Crutchfield, James P.
We introduce a Bayesian approach to discovering patterns in structurally complex processes. The proposed method of Bayesian Structural Inference (BSI) relies on a set of candidate unifilar HMM (uHMM) topologies for inference of process structure from a data series. We employ a recently developed exact enumeration of topological epsilon-machines. (A sequel then removes the topological restriction.) This subset of the uHMM topologies has the added benefit that inferred models are guaranteed to be epsilon-machines, irrespective of estimated transition probabilities. Properties of epsilon-machines and uHMMs allow for the derivation of analytic expressions for estimating transition probabilities, inferring start states, and comparing the posterior probability of candidate model topologies, despite process internal structure being only indirectly present in data. We demonstrate BSI's effectiveness in estimating a process's randomness, as reflected by the Shannon entropy rate, and its structure, as quantified by the statistical complexity. We also compare using the posterior distribution over candidate models and the single, maximum a posteriori model for point estimation and show that the former more accurately reflects uncertainty in estimated values. We apply BSI to in-class examples of finite- and infinite-order Markov processes, as well to an out-of-class, infinite-state hidden process.
Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search
Guez, A., Silver, D., Dayan, P.
Bayesian planning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, planning optimally in the face of uncertainty is notoriously taxing, since the search space is enormous. In this paper we introduce a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our approach avoids expensive applications of Bayes rule within the search tree by sampling models from current beliefs, and furthermore performs this sampling in a lazy manner. This enables it to outperform previous Bayesian model-based reinforcement learning algorithms by a significant margin on several well-known benchmark problems. As we show, our approach can even work in problems with an infinite state space that lie qualitatively out of reach of almost all previous work in Bayesian exploration.
Single Network Relational Transductive Learning
Relational classification on a single connected network has been of particular interest in the machine learning and data mining communities in the last decade or so. This is mainly due to the explosion in popularity of social networking sites such as Facebook, LinkedIn and Google+ amongst others. In statistical relational learning, many techniques have been developed to address this problem, where we have a connected unweighted homogeneous/heterogeneous graph that is partially labeled and the goal is to propagate the labels to the unlabeled nodes. In this paper, we provide a different perspective by enabling the effective use of graph transduction techniques for this problem. We thus exploit the strengths of this class of methods for relational learning problems. We accomplish this by providing a simple procedure for constructing a weight matrix that serves as input to a rich class of graph transduction techniques. Our procedure has multiple desirable properties. For example, the weights it assigns to edges between unlabeled nodes naturally relate to a measure of association commonly used in statistics, namely the Gamma test statistic. We further portray the efficacy of our approach on synthetic as well as real data, by comparing it with state-of-the-art relational learning algorithms, and graph transduction techniques with an adjacency matrix or a real valued weight matrix computed using available attributes as input. In these experiments we see that our approach consistently outperforms other approaches when the graph is sparsely labeled, and remains competitive with the best when the proportion of known labels increases.
Viterbi training in PRISM
Sato, Taisuke, Kubota, Keiichi
VT (Viterbi training), or hard EM, is an efficient way of parameter learning for probabilistic models with hidden variables. Given an observation $y$, it searches for a state of hidden variables $x$ that maximizes $p(x,y \mid \theta)$ by coordinate ascent on parameters $\theta$ and $x$. In this paper we introduce VT to PRISM, a logic-based probabilistic modeling system for generative models. VT improves PRISM in three ways. First VT in PRISM converges faster than EM in PRISM due to the VT's termination condition. Second, parameters learned by VT often show good prediction performance compared to those learned by EM. We conducted two parsing experiments with probabilistic grammars while learning parameters by a variety of inference methods, i.e.\ VT, EM, MAP and VB. The result is that VT achieved the best parsing accuracy among them in both experiments. Also we conducted a similar experiment for classification tasks where a hidden variable is not a prediction target unlike probabilistic grammars. We found that in such a case VT does not necessarily yield superior performance. Third since VT always deals with a single probability of a single explanation, Viterbi explanation, the exclusiveness condition that is imposed on PRISM programs is no more required if we learn parameters by VT. Last but not least we can say that as VT in PRISM is general and applicable to any PRISM program, it largely reduces the need for the user to develop a specific VT algorithm for a specific model. Furthermore since VT in PRISM can be used just by setting a PRISM flag appropriately, it makes VT easily accessible to (probabilistic) logic programmers. To appear in Theory and Practice of Logic Programming (TPLP).
A survey on independence-based Markov networks learning
Name Reference Comments KS Koller and Sahami (1996) - Not Sound - The first one of this type - Requires specifying MB size in advance GS Margaritis and Thrun (2000) - Sound in theory - Proposed to learn Bayesian network via the induction of neighbors of each variable - First proved such kind of algorithm - Works in two phases: grow and shrink IAMB and its variants Tsamardinos et al (2003) - Sound in theory - Actually variant of GS - Simple to implement - Time efficient - Very poor on data efficiency - IAMB's variants achieve better performance on data efficiency than IAMB HITON-PC/MB Aliferis et al (2003) - Not sound - Another trial to make use of the topology information to enhance data efficiency - Data efficiency comparable to IAMB - Much slower compared to IAMB Fast-IAMB Yaramakala and Margaritis (2005) - Sound in theory - No fundamental difference as compared to IAMB - Adds candidates more greedily to speed up the learning - Still poor on data efficiency performance MMPC/MB Tsamardinos et al (2006) - Not sound - The first to make use of the underling topology information - Much more data efficient compared to IAMB - Much slower compared to IAMB PCMB Peรฑa et al (2007) - Sound in theory - Data efficient by making use of topology information - Poor on time efficiency - Distinguish spouses from parents/children - Distinguish some children from parents/children IPC-MB Fu and Desmarais (2008) - Sound in theory - Most data efficient compared with previous algorithms - Much faster than PCMB on computing - Distinguish spouses from parents/children - Distinguish some children from parents/children - Best tradeoff among this family of algorithms
AI Methods in Algorithmic Composition: A Comprehensive Survey
Algorithmic composition is the partial or total automation of the process of music composition by using computers. Since the 1950s, different computational techniques related to Artificial Intelligence have been used for algorithmic composition, including grammatical representations, probabilistic methods, neural networks, symbolic rule-based systems, constraint programming and evolutionary algorithms. This survey aims to be a comprehensive account of research on algorithmic composition, presenting a thorough view of the field for researchers in Artificial Intelligence.
Integrating Declarative Programming and Probabilistic Planning for Robots
Zhang, Shiqi (Texas Tech University) | Sridharan, Mohan (Texas Tech University)
Mobile robots deployed in complex real-world domains typically find it difficult to process all sensor inputs or operate without substantial domain knowledge. At the same time, humans may not have the time and expertise to provide elaborateand accurate knowledge or feedback. The architecture described in this paper combines declarative programming and probabilistic sequential decision-making to address these challenges. Specifically, Answer Set Programming (ASP), a declarative programming paradigm, is combined with hierarchical partially observable Markov decision processes (POMDPs), enabling robots to: (a) represent and reason with incomplete domain knowledge, revising existing knowledge using information extracted from sensor inputs; (b) probabilistically model the uncertainty in sensor input processing and navigation; and (c) use domain knowledge to revise probabilistic beliefs, exploiting positive and negative observations to identify situations in which the assigned task can no longer be pursued. All algorithms are evaluated in simulation and on mobile robots locating target objects in indoor domains.
Combining a POMDP Abstraction with Replanning to Solve Complex, Position-Dependent Sensing Tasks
Grady, Devin (Rice University) | Moll, Mark (Rice University) | Kavraki, Lydia E. (Rice University)
The Partially-Observable Markov Decision Process (POMDP) is a general framework to determine reward-maximizing action policies under noisy action and sensing conditions. However, determining an optimal policy for POMDPs is often intractable for robotic tasks due to the PSPACE-complete nature of the computation required. Several recent solvers have been introduced that expand the size of problems that can be considered. Although these POMDP solvers can respect complex motion constraints in theory, we show that the computational cost does not provide a benefit in the eventual online execution, compared to our alternative approach that relies on a policy that ignores some of the motion constraints. We advocate using the POMDP framework where it is critical -- to find a policy that provides the optimal action given all past noisy sensor observations, while abstracting some of the motion constraints to reduce solution time. However, the actions of an abstract robot are generally not executable under its true motion constraints. The problem is addressed offline with a less-constrained POMDP, and navigation under the full system constraints is handled online with replanning. We empirically demonstrate that the policy generated using this abstracted motion model is faster to compute and achieves similar or higher reward than addressing the motion constraints for a car-like robot as used in our experiments directly in the POMDP.