Measuring Quality of DNA Sequence Data via Degradation

Karr, Alan F., Hauzel, Jason, Porter, Adam A., Schaefer, Marcel

Dec-24-2021–arXiv.org Machine Learning

As public genome databases proliferate, their immense scientific power is tempered by skepticism about their quality. The skepticism is not merely anecdotal: there are documented instances and implications (Commichaux et al., 2021; Langdon, 2014; Steinegger and Salzberg, 2020). Although we argue in Appendix A that data quality should not be construed as comprising only errors in data, the principal contribution of the paper is a novel paradigm for measuring quality of genome sequences by deliberately introducing errors that reduce quality, a process we term degradation. The errors are single nucleotide polymorphisms (SNPs), insertions and deletions that both occur naturally as mutations and arise in next generation sequencing. Our reasoning is that higher quality data are more fragile: the higher the initial quality, the greater the effect of the same amount of degradation.

degradation, genome, iteration, (16 more...)

arXiv.org Machine Learning

Dec-24-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.05)
  - New Jersey > Hudson County
    - Hoboken (0.04)
  - Massachusetts > Middlesex County
    - Cambridge (0.04)
  - Maryland > Prince George's County
    - College Park (0.04)
- Europe > Austria
  - Vienna (0.14)

Genre:
- Research Report (0.50)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Infections and Infectious Diseases (0.33)

Technology:
- Information Technology
  - Biomedical Informatics > Translational Bioinformatics (1.00)
  - Artificial Intelligence > Machine Learning
    - Statistical Learning (0.46)