The early detection of infectious disease outbreaks is a crucial task to protect population health. To this end, public health surveillance systems have been established to systematically collect and analyse infectious disease data. A variety of statistical tools are available, which detect potential outbreaks as abberations from an expected endemic level using these data. Here, we develop the first supervised learning approach based on hidden Markov models for disease outbreak detection, which leverages data that is routinely collected within a public health surveillance system. We evaluate our model using real Salmonella and Campylobacter data, as well as simulations. In comparison to a state-of-the-art approach, which is applied in multiple European countries including Germany, our proposed model reduces the false positive rate by up to 50% while retaining the same sensitivity. We see our supervised learning approach as a significant step to further develop machine learning applications for disease outbreak detection, which will be instrumental to improve public health surveillance systems.
DNA sequence inhomogeneity is present both in training and control sets of coding and non-coding regions. Coding region inhomogeneity, caused by differences in sequence composition between "native" and horizontally transferred genes or between genes expressed at different levels, contributes to the false negative error rate. Inhomogeneity of non-coding region may frequently be caused by the presence of unnoticed genes and contributes to the false positive error rate. We have documented such unnoticed genes in GenBank sequences for several species. Some of protein products of these genes have been characterized by similarity search methods. For others, which we call "pioneer genes", no significant similarity has been found at a protein sequence level although the confidence of GeneMark prediction is high. For instance, to date a majority of those pioneer gene predictions made for E. coil now show strong similarity to more recently characterized proteins that have been added to protein sequence database. Another practical question is related to genomic sequence inhomogeneity at interspecies level: if GeneMark has not been trained for a particular species, is it possible to apply models derived for phylogenetically close genomes?
A team of researchers has found a new way to detect dangerous strains of bacteria, potentially preventing outbreaks of food poisoning. The team developed a method that utilizes machine learning and tested it with isolates of Escherichia coli strains. The details are in a paper that was just published in the journal Proceedings of the National Academy of Sciences. Most strains of Escherichia coli are harmless and naturally found in the human body. There are pathogenic strains, however, and they are a rising health concern.
Correct inference of genetic regulations inside a cell is one of the greatest challenges in post genomic era for the biologist and researchers. Several intelligent techniques and models were already proposed to identify the regulatory relations among genes from the biological database like time series microarray data. Recurrent Neural Network (RNN) is one of the most popular and simple approach to model the dynamics as well as to infer correct dependencies among genes. In this paper, Bat Algorithm (BA) is applied to optimize the model parameters of RNN model of Gene Regulatory Network (GRN). Initially the proposed method is tested against small artificial network without any noise and the efficiency is observed in term of number of iteration, number of population and BA optimization parameters. The model is also validated in presence of different level of random noise for the small artificial network and that proved its ability to infer the correct inferences in presence of noise like real world dataset. In the next phase of this research, BA based RNN is applied to real world benchmark time series microarray dataset of E. coli. The results prove that it can able to identify the maximum number of true positive regulation but also include some false positive regulations. Therefore, BA is very suitable for identifying biological plausible GRN with the help RNN model.
Use of computational methods to predict gene regulatory networks (GRNs) from gene expression data is a challenging task. Many studies have been conducted using unsupervised methods to fulfill the task; however, such methods usually yield low prediction accuracies due to the lack of training data. In this article, we propose semi-supervised methods for GRN prediction by utilizing two machine learning algorithms, namely support vector machines (SVM) and random forests (RF). The semi-supervised methods make use of unlabeled data for training. We investigate inductive and transductive learning approaches, both of which adopt an iterative procedure to obtain reliable negative training data from the unlabeled data. We then apply our semi-supervised methods to gene expression data of Escherichia coli and Saccharomyces cerevisiae, and evaluate the performance of our methods using the expression data. Our analysis indicated that the transductive learning approach outperformed the inductive learning approach for both organisms. However, there was no conclusive difference identified in the performance of SVM and RF. Experimental results also showed that the proposed semi-supervised methods performed better than existing supervised methods for both organisms.