Application of Markov Structure of Genomes to Outlier Identification and Read Classification
Karr, Alan F., Hauzel, Jason, Porter, Adam A., Schaefer, Marcel
That the sequential structure of genomes is important has been known since the discovery of DNA. In this paper we employ a statistics and stochastic process perspective on triplets of successive bases to address two important applications: identifying outliers in genome databases, and classifying reads in the metagenomic context of reference-guided assembly. From this stochastic process perspective, triplets are a second-order Markov chain specified by the distribution of each base conditional on its two immediate predecessors. To be sure, studying genomes via base sequence distributions is not novel. Previous papers have addressed genome signatures (Karlin et al., 1997; Campbell et al., 1999; Takashi et al., 2003), as well as frequentist (Rosen et al., 2008) and Bayesian (Wang et al., 2007) approaches to classification problems.
Dec-24-2021
- Country:
- Europe > Austria
- Vienna (0.14)
- North America > United States
- Maryland > Prince George's County
- College Park (0.04)
- New York (0.04)
- Maryland > Prince George's County
- Europe > Austria
- Genre:
- Research Report (0.50)
- Technology: