Application of Markov Structure of Genomes to Outlier Identification and Read Classification

Karr, Alan F., Hauzel, Jason, Porter, Adam A., Schaefer, Marcel

Dec-24-2021–arXiv.org Machine Learning

That the sequential structure of genomes is important has been known since the discovery of DNA. In this paper we employ a statistics and stochastic process perspective on triplets of successive bases to address two important applications: identifying outliers in genome databases, and classifying reads in the metagenomic context of reference-guided assembly. From this stochastic process perspective, triplets are a second-order Markov chain specified by the distribution of each base conditional on its two immediate predecessors. To be sure, studying genomes via base sequence distributions is not novel. Previous papers have addressed genome signatures (Karlin et al., 1997; Campbell et al., 1999; Takashi et al., 2003), as well as frequentist (Rosen et al., 2008) and Bayesian (Wang et al., 2007) approaches to classification problems.

coronavirus genome, genome, probability, (12 more...)

arXiv.org Machine Learning

Dec-24-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.04)
  - Maryland > Prince George's County
    - College Park (0.04)
- Europe > Austria
  - Vienna (0.14)

Genre:
- Research Report (0.50)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Infections and Infectious Diseases (1.00)
    - Pulmonary/Respiratory Diseases (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Performance Analysis > Accuracy (0.94)
  - Statistical Learning > Clustering (0.68)
  - Learning Graphical Models
    - Directed Networks > Bayesian Learning (0.47)
    - Undirected Networks > Markov Models (0.34)