nanopore
NanoBaseLib: A Multi-Task Benchmark Dataset for Nanopore Sequencing Supplementary Material
Dataset documentation and intended uses. Recommended documentation frameworks include datasheets for datasets, dataset nutrition labels, data statements for NLP, and accountability frameworks. Author statement that they bear all responsibility in case of violation of rights, etc., and Links to access the dataset and its metadata. Simulation environments should link to (open source) code repositories. The dataset itself should ideally use an open and widely used data format.
- Europe > Finland (0.05)
- Asia > South Korea > Busan > Busan (0.04)
- Asia > Middle East > Jordan (0.04)
NanoBaseLib: A Multi-Task Benchmark Dataset for Nanopore Sequencing
Nanopore sequencing is the third-generation sequencing technology with capabilities of generating long-read sequences and directly measuring modifications on DNA/RNA molecules, which makes it ideal for biological applications such as human Telomere-to-Telomere (T2T) genome assembly, Ebola virus surveillance and COVID-19 mRNA vaccine development. However, accuracies of computational methods in various tasks of Nanopore sequencing data analysis are far from satisfactory. For instance, the base calling accuracy of Nanopore RNA sequencing is $\sim$90\%, while the aim is $\sim$99.9\%. This highlights an urgent need of contributions from the machine learning community. A bottleneck that prevents machine learning researchers from entering this field is the lack of a large integrated benchmark dataset.
Path Signatures Enable Model-Free Mapping of RNA Modifications
Lemercier, Maud, Arrubarrena, Paola, Di Giorgio, Salvatore, Brettschneider, Julia, Cass, Thomas, Vries, Isabel S. Naarmann-de, Papavasiliou, Anastasia, Ruggieri, Alessia, Tellioglu, Irem, Wu, Chia Ching, Papavasiliou, F. Nina, Lyons, Terry
Detecting chemical modifications on RNA molecules remains a key challenge in epitranscriptomics. Traditional reverse transcription-based sequencing methods introduce enzyme- and sequence-dependent biases and fragment RNA molecules, confounding the accurate mapping of modifications across the transcriptome. Nanopore direct RNA sequencing offers a powerful alternative by preserving native RNA molecules, enabling the detection of modifications at single-molecule resolution. However, current computational tools can identify only a limited subset of modification types within well-characterized sequence contexts for which ample training data exists. Here, we introduce a model-free computational method that reframes modification detection as an anomaly detection problem, requiring only canonical (unmodified) RNA reads without any other annotated data. For each nanopore read, our approach extracts robust, modification-sensitive features from the raw ionic current signal at a site using the signature transform, then computes an anomaly score by comparing the resulting feature vector to its nearest neighbors in an unmodified reference dataset. We convert anomaly scores into statistical p-values to enable anomaly detection at both individual read and site levels. Validation on densely-modified \textit{E. coli} rRNA demonstrates that our approach detects known sites harboring diverse modification types, without prior training on these modifications. We further applyied this framework to dengue virus (DENV) transcripts and mammalian mRNAs. For DENV sfRNA, it led to revealing a novel 2'-O-methylated site, which we validate orthogonally by qRT-PCR assays. These results demonstrate that our model-free approach operates robustly across different types of RNAs and datasets generated with different nanopore sequencing chemistries.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Europe > Finland (0.05)
- Asia > South Korea > Busan > Busan (0.04)
- Asia > Middle East > Jordan (0.04)
Deep Learning-Driven Peptide Classification in Biological Nanopores
Tovey, Samuel, Hoßbach, Julian, Kuppel, Sandro, Ensslen, Tobias, Behrends, Jan C., Holm, Christian
A device capable of performing real time classification of proteins in a clinical setting would allow for inexpensive and rapid disease diagnosis. One such candidate for this technology are nanopore devices. These devices work by measuring a current signal that arises when a protein or peptide enters a nanometer-length-scale pore. Should this current be uniquely related to the structure of the peptide and its interactions with the pore, the signals can be used to perform identification. While such a method would allow for real time identification of peptides and proteins in a clinical setting, to date, the complexities of these signals limit their accuracy. In this work, we tackle the issue of classification by converting the current signals into scaleogram images via wavelet transforms, capturing amplitude, frequency, and time information in a modality well-suited to machine learning algorithms. When tested on 42 peptides, our method achieved a classification accuracy of ~$81\,\%$, setting a new state-of-the-art in the field and taking a step toward practical peptide/protein diagnostics at the point of care. In addition, we demonstrate model transfer techniques that will be critical when deploying these models into real hardware, paving the way to a new method for real-time disease diagnosis.
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
- Europe > Germany > Baden-Württemberg > Freiburg (0.04)
- North America > United States (0.04)
- (3 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.67)
Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling
Kodra, Riselda, Benmeziane, Hadjer, Boybat, Irem, Simon, William Andrew
The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.
- North America > United States (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
Label-free SERS Discrimination of Proline from Hydroxylated Proline at Single-molecule Level Assisted by a Deep Learning Model
Zhao, Yingqi, Zhan, Kuo, Xin, Pei-Lin, Chen, Zuyan, Li, Shuai, De Angelis, Francesco, Huang, Jianan
ABSTRACT: Discriminating the low-abundance hydroxylated proline from hydroxylated proline is crucial for monitoring diseases and evaluating therapeutic outcomes that require single-molecule sensors. While the plasmonic nanopore sensor can detect the hydroxylation with single-molecule sensitivity by surface enhanced Raman spectroscopy (SERS), it suffers from intrinsic fluctuations of single-molecule signals as well as strong interference from citrates. Here, we used the occurrence frequency histogram of the single-molecule SERS peaks to extract overall dataset spectral features, overcome the signal fluctuations and investigate the citratereplaced plasmonic nanopore sensors for clean and distinguishable signals of proline and hydroxylated proline. By ligand exchange of the citrates by analyte molecules, the representative peaks of citrates decreased with incubation time, proving occupation of the plasmonic hot spot by the analytes. As a result, the discrimination of the single-molecule SERS signals of proline and hydroxylated proline was possible with the convolutional neural network model with 96.6% accuracy.
- Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
- Materials > Chemicals (0.68)
Bacteria-powered artificial tongue can taste-test alcohol for additives
A tiny device home to genetically modified bacteria may soon function like an artificial tongue that rapidly analyzes an alcoholic drink's chemical composition. Using existing biological nanopore technology that underpins DNA sequencing, these new tools could even one day test whether or not a beverage is contaminated with unwanted additives, or even deadly toxins. Current nanopore technology relies on modified bacterium, usually Mycobacterium smegmatis, to perform microscopic chemical assessments. To accomplish this, experts first create extremely tiny holes only a few nanometers wide in the bacteria's cell membrane. Researchers then mix the altered organisms into a liquid before applying a small electrical charge to the solution.