Solving Large Scale Phylogenetic Problems using DCM2

AAAI Conferences

Tandy J. Warnow Department of Computer Science University of Arizona Tucson AZ USA email: tandy cs, arizona, edu Abstract In an earlier paper, we described a new method for phylogenetic tree reconstruction called the Disk Covering Method, or DCM. This is a general method which can be used with an)' existing phylogenetic method in order to improve its performance, lCre showed analytically and experimentally that when DCM is used in conjunction with polynomial time distance-based methods, it improves the accuracy of the trees reconstructed. In this paper, we discuss a variant on DCM, that we call DCM2. DCM2 is designed to be used with phylogenetic methods whose objective is the solution of NPhard optimization problems. We also motivate the need for solutions to NPhard optimization problems by showing that on some very large and important datasets, the most popular (and presumably best performing) polynomial time distance methods have poor accuracy. Introduction 118 HUSON The accurate recovery of the phylogenetic branching order from molecular sequence data is fundamental to many problems in biology. Multiple sequence alignment, gene function prediction, protein structure, and drug design all depend on phylogenetic inference. Although many methods exist for the inference of phylogenetic trees, biologists who specialize in systematics typically compute Maximum Parsimony (MP) or Maximum Likelihood (ML) trees because they are thought to be the best predictors of accurate branching order. Unfortunately, MP and ML optimization problems are NPhard, and typical heuristics use hill-climbing techniques to search through an exponentially large space. When large numbers of taxa are involved, the computational cost of MP and ML methods is so great that it may take years of computation for a local minimum to be obtained on a single dataset (Chase et al. 1993; Rice, Donoghue, & Olmstead 1997). It is because of this computational cost that many biologists resort to distance-based calculations, such as Neighbor-Joining (NJ) (Saitou & Nei 1987), even though these may poor accuracy when the diameter of the tree is large (Huson et al. 1998). As DNA sequencing methods advance, large, divergent, biological datasets are becoming commonplace. For example, the February, 1999 issue of Molecular Biology and Evolution contained five distinct datascts of more than 50 taxa, and two others that had been pruned below that.


Click click snap: One look at patient's face, and AI can identify rare genetic diseases

#artificialintelligence

WASHINGTON D.C. [USA]: According to a recent study, a new artificial intelligence technology can accurately identify rare genetic disorders using a photograph of a patient's face. Named DeepGestalt, the AI technology outperformed clinicians in identifying a range of syndromes in three trials and could add value in personalised care, CNN reported. The study was published in the journal Nature Medicine. According to the study, eight per cent of the population has disease with key genetic components and many may have recognisable facial features. The study further adds that the technology could identify, for example, Angelman syndrome, a disorder affecting the nervous system with characteristic features such as a wide mouth with widely spaced teeth etc. Speaking about it, Yaron Gurovich, the chief technology officer at FDNA and lead researcher of the study said, "It demonstrates how one can successfully apply state of the art algorithms, such as deep learning, to a challenging field where the available data is small, unbalanced in terms of available patients per condition, and where the need to support a large amount of conditions is great."


Targeted Learning with Daily EHR Data

arXiv.org Machine Learning

Electronic health records (EHR) data provide a cost and time-effective opportunity to conduct cohort studies of the effects of multiple time-point interventions in the diverse patient population found in real-world clinical settings. Because the computational cost of analyzing EHR data at daily (or more granular) scale can be quite high, a pragmatic approach has been to partition the follow-up into coarser intervals of pre-specified length. Current guidelines suggest employing a 'small' interval, but the feasibility and practical impact of this recommendation has not been evaluated and no formal methodology to inform this choice has been developed. We start filling these gaps by leveraging large-scale EHR data from a diabetes study to develop and illustrate a fast and scalable targeted learning approach that allows to follow the current recommendation and study its practical impact on inference. More specifically, we map daily EHR data into four analytic datasets using 90, 30, 15 and 5-day intervals. We apply a semi-parametric and doubly robust estimation approach, the longitudinal TMLE, to estimate the causal effects of four dynamic treatment rules with each dataset, and compare the resulting inferences. To overcome the computational challenges presented by the size of these data, we propose a novel TMLE implementation, the 'long-format TMLE', and rely on the latest advances in scalable data-adaptive machine-learning software, xgboost and h2o, for estimation of the TMLE nuisance parameters.


Deep Learning: Not Just for Silicon Valley · fast.ai

#artificialintelligence

Recent American news events range from horrifying to dystopian, but reading the applications of our fast.ai I was blown away by how many bright, creative, resourceful folks from all over the world are applying deep learning to tackle a variety of meaningful and interesting problems. Their passions range from ending illegal logging, diagnosing malaria in rural Uganda, translating Japanese manga, reducing farmer suicides in India via better loans, making Nigerian fashion recommendations, monitoring patients with Parkinson's disease, and more. Our mission at fast.ai is to make deep learning accessible to people from varied backgrounds outside of elite institutions, who are tackling problems in meaningful but low-resource areas, far from mainstream deep learning research. Our group of selected fellows for Deep Learning Part 2 includes people from Nigeria, Ivory Coast, South Africa, Pakistan, Bangladesh, India, Singapore, Israel, Canada, Spain, Germany, France, Poland, Russia, and Turkey.


A brain signature highly predictive of future progression to Alzheimer's dementia

arXiv.org Machine Learning

Early prognosis of Alzheimer's dementia is hard. Mild cognitive impairment (MCI) typically precedes Alzheimer's dementia, yet only a fraction of MCI individuals will progress to dementia, even when screened using biomarkers. We propose here to identify a subset of individuals who share a common brain signature highly predictive of oncoming dementia. This signature was composed of brain atrophy and functional dysconnectivity and discovered using a machine learning model in patients suffering from dementia. The model recognized the same brain signature in MCI individuals, 90% of which progressed to dementia within three years. This result is a marked improvement on the state-of-the-art in prognostic precision, while the brain signature still identified 47% of all MCI progressors. We thus discovered a sizable MCI subpopulation which represents an excellent recruitment target for clinical trials at the prodromal stage of Alzheimer's disease.