Solving Large Scale Phylogenetic Problems using DCM2

AAAI Conferences

Tandy J. Warnow Department of Computer Science University of Arizona Tucson AZ USA email: tandy cs, arizona, edu Abstract In an earlier paper, we described a new method for phylogenetic tree reconstruction called the Disk Covering Method, or DCM. This is a general method which can be used with an)' existing phylogenetic method in order to improve its performance, lCre showed analytically and experimentally that when DCM is used in conjunction with polynomial time distance-based methods, it improves the accuracy of the trees reconstructed. In this paper, we discuss a variant on DCM, that we call DCM2. DCM2 is designed to be used with phylogenetic methods whose objective is the solution of NPhard optimization problems. We also motivate the need for solutions to NPhard optimization problems by showing that on some very large and important datasets, the most popular (and presumably best performing) polynomial time distance methods have poor accuracy. Introduction 118 HUSON The accurate recovery of the phylogenetic branching order from molecular sequence data is fundamental to many problems in biology. Multiple sequence alignment, gene function prediction, protein structure, and drug design all depend on phylogenetic inference. Although many methods exist for the inference of phylogenetic trees, biologists who specialize in systematics typically compute Maximum Parsimony (MP) or Maximum Likelihood (ML) trees because they are thought to be the best predictors of accurate branching order. Unfortunately, MP and ML optimization problems are NPhard, and typical heuristics use hill-climbing techniques to search through an exponentially large space. When large numbers of taxa are involved, the computational cost of MP and ML methods is so great that it may take years of computation for a local minimum to be obtained on a single dataset (Chase et al. 1993; Rice, Donoghue, & Olmstead 1997). It is because of this computational cost that many biologists resort to distance-based calculations, such as Neighbor-Joining (NJ) (Saitou & Nei 1987), even though these may poor accuracy when the diameter of the tree is large (Huson et al. 1998). As DNA sequencing methods advance, large, divergent, biological datasets are becoming commonplace. For example, the February, 1999 issue of Molecular Biology and Evolution contained five distinct datascts of more than 50 taxa, and two others that had been pruned below that.

Discriminative Feature Selection for Uncertain Graph Classification Machine Learning

Mining discriminative features for graph data has attracted much attention in recent years due to its important role in constructing graph classifiers, generating graph indices, etc. Most measurement of interestingness of discriminative subgraph features are defined on certain graphs, where the structure of graph objects are certain, and the binary edges within each graph represent the "presence" of linkages among the nodes. In many real-world applications, however, the linkage structure of the graphs is inherently uncertain. Therefore, existing measurements of interestingness based upon certain graphs are unable to capture the structural uncertainty in these applications effectively. In this paper, we study the problem of discriminative subgraph feature selection from uncertain graphs. This problem is challenging and different from conventional subgraph mining problems because both the structure of the graph objects and the discrimination score of each subgraph feature are uncertain. To address these challenges, we propose a novel discriminative subgraph feature selection method, DUG, which can find discriminative subgraph features in uncertain graphs based upon different statistical measures including expectation, median, mode and phi-probability. We first compute the probability distribution of the discrimination scores for each subgraph feature based on dynamic programming. Then a branch-and-bound algorithm is proposed to search for discriminative subgraphs efficiently. Extensive experiments on various neuroimaging applications (i.e., Alzheimer's Disease, ADHD and HIV) have been performed to analyze the gain in performance by taking into account structural uncertainties in identifying discriminative subgraph features for graph classification.

A brain signature highly predictive of future progression to Alzheimer's dementia Machine Learning

Early prognosis of Alzheimer's dementia is hard. Mild cognitive impairment (MCI) typically precedes Alzheimer's dementia, yet only a fraction of MCI individuals will progress to dementia, even when screened using biomarkers. We propose here to identify a subset of individuals who share a common brain signature highly predictive of oncoming dementia. This signature was composed of brain atrophy and functional dysconnectivity and discovered using a machine learning model in patients suffering from dementia. The model recognized the same brain signature in MCI individuals, 90% of which progressed to dementia within three years. This result is a marked improvement on the state-of-the-art in prognostic precision, while the brain signature still identified 47% of all MCI progressors. We thus discovered a sizable MCI subpopulation which represents an excellent recruitment target for clinical trials at the prodromal stage of Alzheimer's disease.

A rational decision making framework for inhibitory control

Neural Information Processing Systems

Intelligent agents are often faced with the need to choose actions with uncertain consequences, and to modify those actions according to ongoing sensory processing and changing task demands. The requisite ability to dynamically modify or cancel planned actions is known as inhibitory control in psychology. We formalize inhibitory control as a rational decision-making problem, and apply to it to the classical stop-signal task. Using Bayesian inference and stochastic control tools, we show that the optimal policy systematically depends on various parameters of the problem, such as the relative costs of different action choices, the noise level of sensory inputs, and the dynamics of changing environmental demands. Our normative model accounts for a range of behavioral data in humans and animals in the stop-signal task, suggesting that the brain implements statistically optimal, dynamically adaptive, and reward-sensitive decision-making in the context of inhibitory control problems.

Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models Machine Learning

We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing accessible interpretations, critical for both cellular biology and rational drug design. We present an EM algorithm for learning and introduce a model selection criteria based on the physical notion of convergence in relaxation timescales. We contrast our model with standard methods in biophysics and demonstrate improved robustness. We implement our algorithm on GPUs and apply the method to two large protein simulation datasets generated respectively on the NCSA Bluewaters supercomputer and the Folding@Home distributed computing network. Our analysis identifies the conformational dynamics of the ubiquitin protein critical to cellular signaling, and elucidates the stepwise activation mechanism of the c-Src kinase protein.