Phylo2Vec: a vector representation for binary trees
Penn, Matthew J, Scheidwasser, Neil, Khurana, Mark P, Duchêne, David A, Donnelly, Christl A, Bhatt, Samir
–arXiv.org Artificial Intelligence
Binary phylogenetic trees inferred from biological data are central to understanding the shared evolutionary history of organisms. Inferring the placement of latent nodes in a tree by any optimality criterion (e.g., maximum likelihood) is an NP-hard problem, propelling the development of myriad heuristic approaches. Yet, these heuristics often lack a systematic means of uniformly sampling random trees or effectively exploring a tree space that grows factorially, which are crucial to optimisation problems such as machine learning. Phylo2Vec maps any binary tree with n leaves to an integer vector of length n 1. We prove that Phylo2Vec is both well-defined and bijective to the space of phylogenetic trees. The advantages of Phylo2Vec are twofold: i) easy uniform sampling of binary trees and ii) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill climbing-based optimisation can efficiently traverse the vastness of tree space from a random to an optimal tree. Phylogenetic trees are a fundamental tool in depicting evolutionary processes, whether linguistic (evolution of different languages and language families) or biological (evolution of biological entities). In the latter field, phylogenetic trees are integral to multiple research domains, including evolution (Morlon et al., 2010), conservation (Rolland et al., 2011), and epidemiology, where they allow us to better understand infectious disease transmission dynamics (Ypma et al., 2013; Faria et al., 2021). A multitude of computer-readable formats have been proposed to store and represent (binary) phylogenetic trees. While basic data structures such as arrays or linked lists can be used for this purpose, the Newick format, as outlined by Olsen (1990) and Felsenstein (2004), has emerged as the standard notation. Each parenthesis encloses a pair of leaf nodes or subtrees, separated by a comma.
arXiv.org Artificial Intelligence
Dec-1-2023
- Country:
- Europe
- Denmark > Capital Region
- Copenhagen (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- Oxfordshire > Oxford (0.14)
- Tyne and Wear > Sunderland (0.04)
- Denmark > Capital Region
- North America > United States
- Massachusetts > Middlesex County > Reading (0.04)
- South America > Brazil
- Europe
- Genre:
- Research Report (0.82)
- Industry: