Goto

Collaborating Authors

 Allison, Lloyd


The divergence time of protein structures modelled by Markov matrices and its relation to the divergence of sequences

arXiv.org Artificial Intelligence

The evolutionary distance between two species is proportional to some (unknown) function of the time of divergence from their common ancestor. One way to estimate this time is by comparing the underlying macromolecular sequences that cascade the information of accumulated evolutionary changes across DNA RNA Proteins (sequence structure function). Since the introduction of the molecular evolutionary clock by Zuckerkandl and Pauling (1965) to perform phylogenetic studies, several statistical models have been proposed to estimate the divergence of extant sequences from common ancestors, and to correlate the estimates of time from other sources of information (e.g., fossil records) when they exist (Sarich and Wilson, 1967). Such divergence time estimates require reliable statistical models of DNA/RNA/Proteins macromolecules (Bromham and Penny, 2003). For protein amino acid sequences, several statistical models have been proposed to explain sequence variation as a function of time. The point accepted mutation (PAM) matrix of Dayhoff et al. (1978) was the first successful model to explain the mutability of amino acid sequences. PAM is a stochastic (Markov) matrix defined in PAM (time) units where PAM-1 is a Markov matrix that embodies a 1% expected change to the amino acids. Subsequent studies highlighted the importance of incorporating evolutionary time-dependent substitution and gap models as an elegant way to model the divergent relationships of proteins (Holmes, 1998; Gonnet et al., 1992). The recent approach of Sumanaweera et al. (2022) derives a unified statistical model for quantifying the evolution of pairs of protein sequences


Causal KL: Evaluating Causal Discovery

arXiv.org Machine Learning

The two most commonly used criteria for assessing causal model discovery with artificial data are edit-distance and Kullback-Leibler divergence, measured from the true model to the learned model. Both of these metrics maximally reward the true model. However, we argue that they are both insufficiently discriminating in judging the relative merits of false models. Edit distance, for example, fails to distinguish between strong and weak probabilistic dependencies. KL divergence, on the other hand, rewards equally all statistically equivalent models, regardless of their different causal claims. We propose an augmented KL divergence, which we call Causal KL (CKL), which takes into account causal relationships which distinguish between observationally equivalent models. Results are presented for three variants of CKL, showing that Causal KL works well in practice.


Markov Blanket Discovery using Minimum Message Length

arXiv.org Machine Learning

Causal discovery automates the learning of causal Bayesian networks from data and has been of active interest from their beginning. With the sourcing of large data sets off the internet, interest in scaling up to very large data sets has grown. One approach to this is to parallelize search using Markov Blanket (MB) discovery as a first step, followed by a process of combining MBs in a global causal model. We develop and explore three new methods of MB discovery using Minimum Message Length (MML) and compare them empirically to the best existing methods, whether developed specifically as MB discovery or as feature selection. Our best MML method is consistently competitive and has some advantageous features.


Bridging the Gaps in Statistical Models of Protein Alignment

arXiv.org Machine Learning

This work demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed from a time-parameterised substitution matrix and a time-parameterised 3-state alignment machine. All parameters of such a model can be inferred from any benchmark data-set of aligned protein sequences. This allows us to examine nine well-known substitution matrices on six benchmarks curated using various structural alignment methods; any matrix that does not explicitly model a "time"-dependent Markov process is converted to a corresponding base-matrix that does. In addition, a new optimal matrix is inferred for each of the six benchmarks. Using Minimum Message Length (MML) inference, all 15 matrices are compared in terms of measuring the Shannon information content of each benchmark. This has resulted in a new and clear overall best performed time-dependent Markov matrix, MMLSUM, and its associated 3-state machine, whose properties we have analysed in this work. For standard use, the MMLSUM series of (log-odds) \textit{scoring} matrices derived from the above Markov matrix, are available at https://lcb.infotech.monash.edu.au/mmlsum.


The Complexity of Morality: Checking Markov Blanket Consistency with DAGs via Morality

arXiv.org Machine Learning

A family of Markov blankets in a faithful Bayesian network satisfies the symmetry and consistency properties. In this paper, we draw a bijection between families of consistent Markov blankets and moral graphs. We define the new concepts of weak recursive simpliciality and perfect elimination kits. We prove that they are equivalent to graph morality. In addition, we prove that morality can be decided in polynomial time for graphs with maximum degree less than $5$, but the problem is NP-complete for graphs with higher maximum degrees.


Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

arXiv.org Machine Learning

Mixture modelling involves explaining some observed evidence using a combination of probability distributions. The crux of the problem is the inference of an optimal number of mixture components and their corresponding parameters. This paper discusses unsupervised learning of mixture models using the Bayesian Minimum Message Length (MML) criterion. To demonstrate the effectiveness of search and inference of mixture parameters using the proposed approach, we select two key probability distributions, each handling fundamentally different types of data: the multivariate Gaussian distribution to address mixture modelling of data distributed in Euclidean space, and the multivariate von Mises-Fisher (vMF) distribution to address mixture modelling of directional data distributed on a unit hypersphere. The key contributions of this paper, in addition to the general search and inference methodology, include the derivation of MML expressions for encoding the data using multivariate Gaussian and von Mises-Fisher distributions, and the analytical derivation of the MML estimates of the parameters of the two distributions. Our approach is tested on simulated and real world data sets. For instance, we infer vMF mixtures that concisely explain experimentally determined three-dimensional protein conformations, providing an effective null model description of protein structures that is central to many inference problems in structural bioinformatics. The experimental results demonstrate that the performance of our proposed search and inference method along with the encoding schemes improve on the state of the art mixture modelling techniques.