Clustering sequence sets for motif discovery

Neural Information Processing Systems

Most of existing methods for DNA motif discovery consider only a single set of sequences to find an over-represented motif. In contrast, we consider multiple sets of sequences where we group sets associated with the same motif into a cluster, assuming that each set involves a single motif. Clustering sets of sequences yields clusters of coherent motifs, improving signal-to-noise ratio or enabling us to identify multiple motifs. We present a probabilistic model for DNA motif discovery where we identify multiple motifs through searching for patterns which are shared across multiple sets of sequences. Our model infers cluster-indicating latent variables and learns motifs simultaneously, where these two tasks interact with each other. We show that our model can handle various motif discovery problems, depending on how to construct multiple sets of sequences. Experiments on three different problems for discovering DNA motifs emphasize the useful behavior and confirm the substantial gains over existing methods where only single set of sequences is considered.


Discovering Multivariate Motifs using Subsequence Density Estimation and Greedy Mixture Learning

AAAI Conferences

The problem of locating motifs in real-valued, multivariate time series data involves the discovery of sets of recurring patterns embedded in the time series. Each set is composed of several non-overlapping subsequences and constitutes a motif because all of the included subsequences are similar. The ability to automatically discover such motifs allows intelligent systems to form endogenously meaningful representations of their environment through unsupervised sensor analysis. In this paper, we formulate a unifying view of motif discovery as a problem of locating regions of high density in the space of all time series subsequences. Our approach is efficient (sub-quadratic in the length of the data), requires fewer user-specified parameters than previous methods, and naturally allows variable length motif occurrences and nonlinear temporal warping. We evaluate the performance of our approach using four data sets from different domains including on-body inertial sensors and speech.


ENUMERATING AND RANKING DISCRETE MOTIFS

AAAI Conferences

Sequence motifs allow functional inferences to be made on the basis of homology, and provide clues to important structural constraints. In the past, motifs have been found by a hit-or-miss process of heuristically pruning the space of motifs. We have discovered that, surprisingly, the motifs can usually be enumerated exhaustively. This paper describes the development of EMOTIF, a system that is capable of enumerating the entire space of motifs from a sequence alignment and choosing the motif that maximizes a scoring function based on both statistics and information theory.


Fitting a mixture model by expectation maximization to discover motifs in biopolymers

AAAI Conferences

The motif model used by MM says that each position in a subsequence which is an occurrence of the motif is generated by an independent random variable describing a multinomial trial with parameter fi --- (fil,..., fiL). That is, the probability of letter aj appearing in position i in the motif is fli.


The value of prior knowledge in discovering motifs with MEME

AAAI Conferences

The E-step of EM calculates the expected value of the missing information--the probability that a motif occurrence starts in position j of sequence Xi. The formulas used by MEME for the three types of model are given below. Derivations are given elsewhere (Bailey Elkan 1995b).