Frey, Brendan J.
Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
Jojic, Nebojsa, Jojic, Vladimir, Meek, Christopher, Heckerman, David, Frey, Brendan J.
We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences fromthe dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. Inour experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively smallvaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about T-cell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties.
Probabilistic Inference of Alternative Splicing Events in Microarray Data
Shai, Ofer, Frey, Brendan J., Morris, Quaid D., Pan, Qun, Misquitta, Christine, Blencowe, Benjamin J.
Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. The extent of AS regulation, and the mechanisms involved, are not well understood. We have developed a custom DNA microarray platform for surveying AS levels on a large scale. We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. A comparison of the results obtained with a well-established but low throughput experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues.
Probabilistic Inference of Speech Signals from Phaseless Spectrograms
Achan, Kannan, Roweis, Sam T., Frey, Brendan J.
Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram.
Denoising and Untangling Graphs Using Degree Priors
Morris, Quaid D., Frey, Brendan J.
This paper addresses the problem of untangling hidden graphs from a set of noisy detections of undirected edges. We present a model of the generation of the observed graph that includes degree-based structure priors on the hidden graphs. Exact inference in the model is intractable; we present an efficient approximate inference algorithm tocompute edge appearance posteriors. We evaluate our model and algorithm on a biological graph inference problem.
Fast, Large-Scale Transformation-Invariant Clustering
Frey, Brendan J., Jojic, Nebojsa
In previous work on "transformed mixtures of Gaussians" and "transformed hidden Markov models", we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speedup is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities.
Fast, Large-Scale Transformation-Invariant Clustering
Frey, Brendan J., Jojic, Nebojsa
In previous work on "transformed mixtures of Gaussians" and "transformed hidden Markov models", we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speedup is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities.
Very loopy belief propagation for unwrapping phase images
Frey, Brendan J., Koetter, Ralf, Petrovic, Nemanja
Since the discovery that the best error-correcting decoding algorithm can be viewed as belief propagation in a cycle-bound graph, researchers have been trying to determine under what circumstances "loopy belief propagation" is effective for probabilistic inference. Despite several theoretical advances in our understanding of loopy belief propagation, to our knowledge, the only problem that has been solved using loopy belief propagation is error-correcting decoding on Gaussian channels. We propose a new representation for the two-dimensional phase unwrapping problem, and we show that loopy belief propagation produces results that are superior to existing techniques. This is an important result, since many imaging techniques, including magnetic resonance imaging and interferometric synthetic aperture radar, produce phase-wrapped images. Interestingly, the graph that we use has a very large number of very short cycles, supporting evidence that a large minimum cycle length is not needed for excellent results using belief propagation. 1 Introduction Phase unwrapping is an easily stated, fundamental problem in image processing (Ghiglia and Pritt 1998).
ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
Frey, Brendan J., Kristjansson, Trausti T., Deng, Li, Acero, Alex
A challenging, unsolved problem in the speech recognition community is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recognition is to automatically remove the noise from the cepstrum sequence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noisefree speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for probabilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech.
Product Analysis: Learning to Model Observations as Products of Hidden Variables
Frey, Brendan J., Kannan, Anitha, Jojic, Nebojsa
Factor analysis and principal components analysis can be used to model linear relationships between observed variables and linearly map high-dimensional data to a lower-dimensional hidden space. In factor analysis, the observations are modeled as a linear combination of normally distributed hidden variables. We describe a nonlinear generalization of factor analysis, called "product analysis", that models the observed variables as a linear combination of products of normally distributed hidden variables. Just as factor analysis can be viewed as unsupervised linear regression on unobserved, normally distributed hidden variables, product analysis can be viewed as unsupervised linear regression on products of unobserved, normally distributed hidden variables. The mapping between the data and the hidden space is nonlinear, so we use an approximate variational technique for inference and learning.