ADRS-CNet: An adaptive models of dimensionality reduction methods for DNA storage clustering algorithms

Liu, Bowen, Li, Jiankun

arXiv.org Artificial Intelligence 

In the downstream information retrieval process of DNA storage technology, specific hybridization techniques, such as Polymerase Chain Reaction (PCR) or magnetic bead separation, are commonly used to access data [1]. However, this technology faces several challenges, including high base error rates (insertions, deletions, substitutions, etc.) and the loss of storage sequences, which pose significant threats to the reliability of stored data [2]. To address these issues, clustering and alignment of sequencing data can be employed. A commonly used feature extraction method is based on k-mer frequency matrices, where the dimensionality of the extracted features increases exponentially with the value of k [3] [4] [5]. Therefore, selecting an appropriate dimensionality reduction technique becomes a critical challenge that needs to be addressed. This study aims to develop an adaptive classification model to identify the optimal dimensionality reduction method, thereby mitigating the curse of dimensionality caused by k-mer feature extraction and enhancing the effectiveness of K-means clustering in restoring the original sequence information. Specifically, among the numerous available algorithms, Principal Component Analysis (PCA) [6], t-distributed Stochastic Neighbor Embedding (t-SNE) [7], and Uniform Manifold Approximation and Projection (UMAP) [8] are particularly prominent in the fields of cell biology, bioinformatics, and data visualization [9]. This study addresses the challenge of selecting the appropriate dimensionality reduction method to mitigate the curse of dimensionality in K-means clustering.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found