A nonparametric HMM for genetic imputation and coalescent inference

arXiv.org Machine Learning

Genetic sequence data are well described by hidden Markov models (HMMs) in which latent states correspond to clusters of similar mutation patterns. Theory from statistical genetics suggests that these HMMs are nonhomogeneous (their transition probabilities vary along the chromosome) and have large support for self transitions. We develop a new nonparametric model of genetic sequence data, based on the hierarchical Dirichlet process, which supports these self transitions and nonhomogeneity. Our model provides a parameterization of the genetic process that is more parsimonious than other more general nonparametric models which have previously been applied to population genetics. We provide truncation-free MCMC inference for our model using a new auxiliary sampling scheme for Bayesian nonparametric HMMs. In a series of experiments on male X chromosome data from the Thousand Genomes Project and also on data simulated from a population bottleneck we show the benefits of our model over the popular finite model fastPHASE, which can itself be seen as a parametric truncation of our model. We find that the number of HMM states found by our model is correlated with the time to the most recent common ancestor in population bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics applied to large and complex genetic data.


Probabilistic Event Cascades for Alzheimer's disease

Neural Information Processing Systems

Accurate and detailed models of the progression of neurodegenerative diseases such as Alzheimer's (AD) are crucially important for reliable early diagnosis and the determination and deployment of effective treatments. In this paper, we introduce the ALPACA (Alzheimer's disease Probabilistic Cascades) model, a generative model linking latent Alzheimer's progression dynamics to observable biomarker data. In contrast with previous works which model disease progression as a fixed ordering of events, we explicitly model the variability over such orderings among patients which is more realistic, particularly for highly detailed disease progression models. We describe efficient learning algorithms for ALPACA and discuss promising experimental results on a real cohort of Alzheimer's patients from the Alzheimer's Disease Neuroimaging Initiative.


Neural Networks for Determining Protein Specificity and Multiple Alignment of Binding Sites

AAAI Conferences

Regulation of gene expression often involves proteins that bind to particular regions of DNA. Determining the binding sites for a protein and its specificity usually requires extensive biochemical and/or genetic experimentation. In this paper we illustrate the use of a neural network to obtain the desired information with much less experimental effort. It is often fairly easy to obtain a set of moderate length sequences, perhaps one or two hundred base-pairs, that each contain binding sites for the protein being studied. For example, the upstream regions of a set of genes that are all regulated by the same protein should each contain binding sites for that protein.


Constraining the Dynamics of Deep Probabilistic Models

arXiv.org Machine Learning

We introduce a novel generative formulation of deep probabilistic models implementing "soft" constraints on their function dynamics. In particular, we develop a flexible methodological framework where the modeled functions and derivatives of a given order are subject to inequality or equality constraints. We then characterize the posterior distribution over model and constraint parameters through stochastic variational inference. As a result, the proposed approach allows for accurate and scalable uncertainty quantification on the predictions and on all parameters. We demonstrate the application of equality constraints in the challenging problem of parameter inference in ordinary differential equation models, while we showcase the application of inequality constraints on the problem of monotonic regression of count data. The proposed approach is extensively tested in several experimental settings, leading to highly competitive results in challenging modeling applications, while offering high expressiveness, flexibility and scalability.


RNA Modeling Using Gibbs Sampling and Stochastic Context Free Grammars

AAAI Conferences

Leslie Grate and Mark Herbster and Richard Hughey and David Haussler Baskin (;enter for Computer Engineering and Computer and Information Sciences University of California Santa Cruz, CA 95064 Keywords: RNA secondary structure, Gibbs sampler, Expectation Maximization, stochastic contextfree grammars, hidden Markov models, tP NA, snRNA, 16S rRNA, linguistic methods Abstract A new method of discovering the common secondary structure of a family of homologous RNA sequences using Gibbs sampling and stochastic context-free grammars is proposed. These parameters describe a statistical model of the family. After the Gibbs sampling has produced a crude statistical model for the family, this model is translated into a stochastic context-free grammar, which is then refined by an Expectation Maximization (EM) procedure produce a more complete model. A prototype implementation of the method is tested on tRNA, pieces of 16S rRNA and on U5 snRNA with good results. I. Saira Mian and Harry Noller Sinsheimer Laboratories University of California Santa Cruz, CA 95064 Introduction Tools for analyzing RNA are becoming increasingly important as in vitro evolution and selection techniques produce greater numbers of synthesized RNA families to supplement those related by phylogeny. Two principal methods have been established for predicting RNA secondary structure base pairings. The second technique employs thermodynamics to compare the free energy changes predicted for formation of possible s,'covdary structure and relies on finding the structure with the lowest free energy (Tinoco Jr., Uhlenbeck, & Levine 1971: Turner, Sugimoto, & Freier 1988; *This work was supported in part by NSF grants C,I)A-9115268 and IR1-9123692, and NIIt gratnt (.;M17129. When several related sequences are available that all share a common secondary structure, combinations of different approaches have been used to obtain improved results (Waterman 1989; Le & Zuker 1991; Han& Kim 1993; Chiu & Kolodziejczak 1991; Sankoff 1985; Winker et al. 1990; Lapedes 1992; Klinger & Brutlag 1993; Gutell et aL 1992). Recent efforts have applied Stochastic Context-Free Grammars (SCFGs) to the problems of statistical modeling, multiple alignment, discrimination and prediction of the secondary structure of RNA families (Sakakibara el al. 1994; 1993; Eddy & Durbin 1994; Searls 1993).