An Exact Algorithm to Identify Motifs in Orthologous Sequences from Multiple Species

AAAI Conferences

Mathieu Blanchette, Benno Schwikowski, and Martin Tompa Department of Computer Science and Engineering Box 352350 University of Washington Seattle, WA 98195-2350 U.S.A. 206-543-9263 fax: 206-543-8331 {blanchem, benno, tompa} cs, washington, edu Abstract The identification of sequence motifs is a fundamental method for suggesting good candidates for biologically functional regions such as promoters, splice sites, binding sites, etc. We investigate the following approach to identifying motifs: given a collection of orthologous sequences from multiple species related by a known phylogenetic tree, search for motifs that axe well conserved (according to a parsimony measure) in the species. We present an exact algorithm for solving this problem. We then discuss experimental results on finding promoters of the rbcS gene for a family of 10 plants, on finding promoters of the adh gene for 12 Drosophila species, and on finding promoters of several chloroplast encoded genes. Using motifs to find functional regions One of the fundamental techniques in biological sequence analysis is the identification of sequence motifs as a means of suggesting good candidates for biologically functional regions such as promoters, splice sites, binding sites, etc. For short repeated sites, the common way this is done is to assemble a collection of sequences from a single genome believed to contain the site, and search in these sequences for a pattern that occurs in a statistically significant overabundance. To find promoters, for example, one would identify genes believed to be coregulated, and then look for patterns in their regulatory regions. From a computational point of view, the methods used to find the patterns fall into four main categories.

Efficient Haplotype Inference with Boolean Satisfiability

AAAI Conferences

One of the main topics of research in genomics is determining the relevance of mutations, described in haplotype data, as causes of some genetic diseases. However, due to technological limitations, genotype data rather than haplotype data is usually obtained. The haplotype inference by pure parsimony (HIPP) problem consists in inferring haplotypes from genotypes s.t. the number of required haplotypes is minimum. Previous approaches to the HIPP problem have focused on integer programming models and branch-and-bound algorithms. In contrast, this paper proposes the utilization of Boolean Satisfiability (SAT). The proposed solution entails a SAT model, a number of key pruning techniques, and an iterative algorithm that enumerates the possible solution values for the target optimization problem. Experimental results, obtained on a wide range of instances, demonstrate that the SATbased approach can be several orders of magnitude faster than existing solutions. Besides being more efficient, the SATbased approach is also the only capable of computing the solution for a large number of instances.

Clustering Documents Along Multiple Dimensions

AAAI Conferences

Traditional clustering algorithms are designed to search for a single clustering solution despite the fact that multiple alternative solutions might exist for a particular dataset. For example, a set of news articles might be clustered by topic or by the author's gender or age. Similarly, book reviews might be clustered by sentiment or comprehensiveness. In this paper, we address the problem of identifying alternative clustering solutions by developing a Probabilistic Multi-Clustering (PMC) model that discovers multiple, maximally different clusterings of a data sample. Empirical results on six datasets representative of real-world applications show that our PMC model exhibits superior performance to comparable multi-clustering algorithms.

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

AAAI Conferences

The breakpoint phylogeny is an optimization problem proposed by Blanchette et al. for reconstructing evolutionary trees from gene order data. These same authors also developed and implemented BPAnalysis [3], a heuristic method (based upon solving many instances of the travelling salesman problem) for estimating the breakpoint phylogeny. We present a new heuristic for this purpose; although not polynomial-time, our heuristic is much faster in practice than BPAnalysis. We present and discuss the results of experimentation on synthetic datasets and on the flowering plant family Campanulaceae with three methods: our new method, BPAnalysis, and the neighbor-joining method [25] using several distance estimation techniques. Our preliminary results indicate that, on datasets with slow evolutionary rates and large numbers of genes in comparison with the number of taxa (genomes), all methods recover quite accurate reconstructions of the true evolutionary history (although BPAnalysis is too slow to be practical), but that on datasets where the rate of evolution is high relative to the number of genes, the accuracy of all three methods is poor.

Clustering with phylogenetic tools in astrophysics Machine Learning

Phylogenetic approaches are finding more and more applications outside the field of biology. Astrophysics is no exception since an overwhelming amount of multivariate data has appeared in the last twenty years or so. In particular, the diversification of galaxies throughout the evolution of the Universe quite naturally invokes phylogenetic approaches. We have demonstrated that Maximum Parsimony brings useful astrophysical results, and we now proceed toward the analyses of large datasets for galaxies. In this talk I present how we solve the major difficulties for this goal: the choice of the parameters, their discretization, and the analysis of a high number of objects with an unsupervised NP-hard classification technique like cladistics. 1. Introduction How do the galaxy form, and when? How did the galaxy evolve and transform themselves to create the diversity we observe? What are the progenitors to present-day galaxies? To answer these big questions, observations throughout the Universe and the physical modelisation are obvious tools. But between these, there is a key process, without which it would be impossible to extract some digestible information from the complexity of these systems. This is classification. One century ago, galaxies were discovered by Hubble. From images obtained in the visible range of wavelengths, he synthetised his observations through the usual process: classification. With only one parameter (the shape) that is qualitative and determined with the eye, he found four categories: ellipticals, spirals, barred spirals and irregulars. This is the famous Hubble classification. He later hypothetized relationships between these classes, building the Hubble Tuning Fork. The Hubble classification has been refined, notably by de Vaucouleurs, and is still used as the only global classification of galaxies. Even though the physical relationships proposed by Hubble are not retained any more, the Hubble Tuning Fork is nearly always used to represent the classification of the galaxy diversity under its new name the Hubble sequence (e.g. Delgado-Serrano, 2012). Its success is impressive and can be understood by its simplicity, even its beauty, and by the many correlations found between the morphology of galaxies and their other properties. And one must admit that there is no alternative up to now, even though both the Hubble classification and diagram have been recognised to be unsatisfactory. Among the most obvious flaws of this classification, one must mention its monovariate, qualitative, subjective and old-fashioned nature, as well as the difficulty to characterise the morphology of distant galaxies. The first two most significant multivariate studies were by Watanabe et al. (1985) and Whitmore (1984). Since the year 2005, the number of studies attempting to go beyond the Hubble classification has increased largely. Why, despite of this, the Hubble classification and its sequence are still alive and no alternative have yet emerged (Sandage, 2005)? My feeling is that the results of the multivariate analyses are not easily integrated into a one-century old practice of modeling the observations. In addition, extragalactic objects like galaxies, stellar clusters or stars do evolve. Astronomy now provides data on very distant objects, raising the question of the relationships between those and our present day nearby galaxies. Clearly, this is a phylogenetic problem. Astrocladistics 1 aims at exploring the use of phylogenetic tools in astrophysics (Fraix-Burnet et al., 2006a,b). We have proved that Maximum Parsimony (or cladistics) can be applied in astrophysics and provides a new exploration tool of the data (Fraix-Burnet et al., 2009, 2012, Cardone \& Fraix-Burnet, 2013). As far as the classification of galaxies is concerned, a larger number of objects must now be analysed. In this paper, I