Goto

Collaborating Authors

 rna-seq


Rare Genomic Subtype Discovery from RNA-seq via Autoencoder Embeddings and Stability-Aware Clustering

Mezghiche, Alaa

arXiv.org Artificial Intelligence

Unsupervised learning on high-dimensional RNA-seq data can reveal molecular subtypes beyond standard labels. We combine an autoencoder-based representation with clustering and stability analysis to search for rare but reproducible genomic subtypes. On the UCI "Gene Expression Cancer RNA-Seq" dataset (801 samples, 20,531 genes; BRCA, COAD, KIRC, LUAD, PRAD), a pan-cancer analysis shows clusters aligning almost perfectly with tissue of origin (Cramer's V = 0.887), serving as a negative control. We therefore reframe the problem within KIRC (n = 146): we select the top 2,000 highly variable genes, standardize them, train a feed-forward autoencoder (128-dimensional latent space), and run k-means for k = 2-10. While global indices favor small k, scanning k with a pre-specified discovery rule (rare < 10 percent and stable with Jaccard >= 0.60 across 20 seeds after Hungarian alignment) yields a simple solution at k = 5 (silhouette = 0.129, DBI = 2.045) with a rare cluster C0 (6.85 percent of patients) that is highly stable (Jaccard = 0.787). Cluster-vs-rest differential expression (Welch's t-test, Benjamini-Hochberg FDR) identifies coherent markers. Overall, pan-cancer clustering is dominated by tissue of origin, whereas a stability-aware within-cancer approach reveals a rare, reproducible KIRC subtype.


Assessing Concordance between RNA-Seq and NanoString Technologies in Ebola-Infected Nonhuman Primates Using Machine Learning

Rezapour, Mostafa, Narayanan, Aarthi, Mowery, Wyatt H., Gurcan, Metin Nafi

arXiv.org Artificial Intelligence

This study evaluates the concordance between RNA sequencing (RNA-Seq) and NanoString technologies for gene expression analysis in non-human primates (NHPs) infected with Ebola virus (EBOV). We performed a detailed comparison of both platforms, demonstrating a strong correlation between them, with Spearman coefficients for 56 out of 62 samples ranging from 0.78 to 0.88, with a mean of 0.83 and a median of 0.85. Bland-Altman analysis further confirmed high consistency, with most measurements falling within 95% confidence limits. A machine learning approach, using the Supervised Magnitude-Altitude Scoring (SMAS) method trained on NanoString data, identified OAS1 as a key marker for distinguishing RT-qPCR positive from negative samples. Remarkably, when applied to RNA-Seq data, OAS1 also achieved 100% accuracy in differentiating infected from uninfected samples using logistic regression, demonstrating its robustness across platforms. Further differential expression analysis identified 12 common genes including ISG15, OAS1, IFI44, IFI27, IFIT2, IFIT3, IFI44L, MX1, MX2, OAS2, RSAD2, and OASL which demonstrated the highest levels of statistical significance and biological relevance across both platforms. Gene Ontology (GO) analysis confirmed that these genes are directly involved in key immune and viral infection pathways, reinforcing their importance in EBOV infection. In addition, RNA-Seq uniquely identified genes such as CASP5, USP18, and DDX60, which play key roles in immune regulation and antiviral defense. This finding highlights the broader detection capabilities of RNA-Seq and underscores the complementary strengths of both platforms in providing a comprehensive and accurate assessment of gene expression changes during Ebola virus infection.


Multi-Omic and Quantum Machine Learning Integration for Lung Subtypes Classification

Saggi, Mandeep Kaur, Bhatia, Amandeep Singh, Isaiah, Mensah, Gowher, Humaira, Kais, Sabre

arXiv.org Artificial Intelligence

Quantum Machine Learning (QML) is a red-hot field that brings novel discoveries and exciting opportunities to resolve, speed up, or refine the analysis of a wide range of computational problems. In the realm of biomedical research and personalized medicine, the significance of multi-omics integration lies in its ability to provide a thorough and holistic comprehension of complex biological systems. This technology links fundamental research to clinical practice. The insights gained from integrated omics data can be translated into clinical tools for diagnosis, prognosis, and treatment planning. The fusion of quantum computing and machine learning holds promise for unraveling complex patterns within multi-omics datasets, providing unprecedented insights into the molecular landscape of lung cancer. Due to the heterogeneity, complexity, and high dimensionality of multi-omic cancer data, characterized by the vast number of features (such as gene expression, micro-RNA, and DNA methylation) relative to the limited number of lung cancer patient samples, our prime motivation for this paper is the integration of multi-omic data, unique feature selection, and diagnostic classification of lung subtypes: lung squamous cell carcinoma (LUSC-I) and lung adenocarcinoma (LUAD-II) using quantum machine learning. We developed a method for finding the best differentiating features between LUAD and LUSC datasets, which has the potential for biomarker discovery.


Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction

Wang, Hongxiao, Yang, Yang, Zhao, Zhuo, Gu, Pengfei, Sapkota, Nishchal, Chen, Danny Z.

arXiv.org Artificial Intelligence

For predicting cancer survival outcomes, standard approaches in clinical research are often based on two main modalities: pathology images for observing cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene expressions. However, existing pathology-genomic multi-modal algorithms face significant challenges: (1) Valuable biological insights regarding genes and gene-gene interactions are frequently overlooked; (2) one modality often dominates the optimization process, causing inadequate training for the other modality. In this paper, we introduce a new multi-modal ``Path-GPTOmic" framework for cancer survival outcome prediction. First, to extract valuable biological insights, we regulate the embedding space of a foundation model, scGPT, initially trained on single-cell RNA-seq data, making it adaptable for bulk RNA-seq data. Second, to address the imbalance-between-modalities problem, we propose a gradient modulation mechanism tailored to the Cox partial likelihood loss for survival prediction. The contributions of the modalities are dynamically monitored and adjusted during the training process, encouraging that both modalities are sufficiently trained. Evaluated on two TCGA(The Cancer Genome Atlas) datasets, our model achieves substantially improved survival prediction accuracy.


Computational Genomics

#artificialintelligence

COVID-19 related info: We might choose to do this training online depending on the status of the pandemic in September. The general aim of the course is to equip participants with practical and technical knowledge to analyze single cell RNA-seq data. With this aim in mind, we will go through unsupervised machine learning methods to analyze high-dimensional data sets, and move on to statistical methods developed to analyze bulk RNA-seq. Lastly, we will introduce analysis techniques used for single cell RNA-seq. There will be theoretical lectures followed by practical sessions where students directly apply what they have learned.


Where To Find Publically Available Genomics Data For Deep Learning?

#artificialintelligence

GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a diverse range of RNA sources, comparative genomics, integrative bioinformatic methods, and human curation. ArrayExpress Archive of Functional Genomics Data stores data from high-throughput functional genomics experiments. The EGA provides a service for the permanent archiving and distribution of personally identifiable genetic and phenotypic data resulting from biomedical research projects.


mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

Zeller, Georg, Goernitz, Nico, Kahles, Andre, Behr, Jonas, Mudrakarta, Pramod, Sonnenburg, Soeren, Raetsch, Gunnar

arXiv.org Machine Learning

High-throughput sequencing technology applied to cellular mRNA (RNA-Seq) has revolutionized transcriptome studies [19, 17, 35, among many others]. In contrast to microarray platforms, which it has replaced in many applications, RNA-Seq can not only be used to accurately quantify known transcripts, but also to reveal the precise structure of transcripts at single-nucleotide resolution. RNA-Seq based transcript reconstruction has therefore become a valuable tool for the completion of genome annotations [22, 11, for instance] and further enabled subsequent analyses of differentially expressed genes [2], transcript isoforms [6, 4] and exons [3], all of which generally rely on correctly inferred transcript inventories. De novo transcript reconstruction is thus a pivotal step in the analysis of RNA-Seq data. There are two conceptually different strategies to approach this problem: one can either assemble transcripts directly from RNA-Seq reads using methodology that originated from genome assembly approaches [13, 23, 25].