Measuring Quality of DNA Sequence Data via Degradation
Karr, Alan F., Hauzel, Jason, Porter, Adam A., Schaefer, Marcel
As public genome databases proliferate, their immense scientific power is tempered by skepticism about their quality. The skepticism is not merely anecdotal: there are documented instances and implications (Commichaux et al., 2021; Langdon, 2014; Steinegger and Salzberg, 2020). Although we argue in Appendix A that data quality should not be construed as comprising only errors in data, the principal contribution of the paper is a novel paradigm for measuring quality of genome sequences by deliberately introducing errors that reduce quality, a process we term degradation. The errors are single nucleotide polymorphisms (SNPs), insertions and deletions that both occur naturally as mutations and arise in next generation sequencing. Our reasoning is that higher quality data are more fragile: the higher the initial quality, the greater the effect of the same amount of degradation.
Dec-24-2021
- Country:
- Europe > Austria
- Vienna (0.14)
- North America > United States
- Maryland > Prince George's County
- College Park (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- New Jersey > Hudson County
- Hoboken (0.04)
- New York (0.05)
- Maryland > Prince George's County
- Europe > Austria
- Genre:
- Research Report (0.50)
- Industry: