biological process
Nonlinear multi-study factor analysis
Moran, Gemma E., Krishnan, Anandi
High-dimensional data often exhibit variation that can be captured by lower dimensional factors. For high-dimensional data from multiple studies or environments, one goal is to understand which underlying factors are common to all studies, and which factors are study or environment-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are co-expressed; we may expect some clusters (or biological pathways) to be active for all diseases, while some clusters are only active for a specific disease. To learn these factors, we consider a nonlinear multi-study factor model, which allows for both shared and specific factors. To fit this model, we propose a multi-study sparse variational autoencoder. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. In the genomics example, this means each gene is active in only a few biological processes. Further, the model implicitly induces a penalty on the number of latent factors, which helps separate the shared factors from the group-specific factors. We prove that the latent factors are identified, and demonstrate our method recovers meaningful factors in the platelet gene expression data.
Time-Varying Network Driver Estimation (TNDE) Quantifies Stage-Specific Regulatory Effects From Single-Cell Snapshots
Identifying key driver genes governing biological processes such as development and disease progression remains a challenge. While existing methods can reconstruct cellular trajectories or infer static gene regulatory networks (GRNs), they often fail to quantify time-resolved regulatory effects within specific temporal windows. Here, we present Time-varying Network Driver Estimation (TNDE), a computational framework quantifying dynamic gene driver effects from single-cell snapshot data under a linear Markov assumption. TNDE leverages a shared graph attention encoder to preserve the local topological structure of the data. Furthermore, by incorporating partial optimal transport, TNDE accounts for unmatched cells arising from proliferation or apoptosis, thereby enabling trajectory alignment in non-equilibrium processes. Benchmarking on simulated datasets demonstrates that TNDE outperforms existing baseline methods across diverse complex regulatory scenarios. Applied to mouse erythropoiesis data, TNDE identifies stage-specific driver genes, the functional relevance of which is corroborated by biological validation. TNDE offers an effective quantitative tool for dissecting dynamic regulatory mechanisms underlying complex biological processes.
Interpretable Causal Representation Learning for Biological Data in the Pathway Space
de la Fuente, Jesus, Lehmann, Robert, Ruiz-Arenas, Carlos, Voges, Jan, Marin-Goñi, Irene, Martinez-de-Morentin, Xabier, Gomez-Cabrero, David, Ochoa, Idoia, Tegner, Jesper, Lagani, Vincenzo, Hernaez, Mikel
Predicting the impact of genomic and drug perturbations in cellular function is crucial for understanding gene functions and drug effects, ultimately leading to improved therapies. To this end, Causal Representation Learning (CRL) constitutes one of the most promising approaches, as it aims to identify the latent factors that causally govern biological systems, thus facilitating the prediction of the effect of unseen perturbations. Yet, current CRL methods fail in reconciling their principled latent representations with known biological processes, leading to models that are not interpretable. To address this major issue, we present SENA-discrepancy-VAE, a model based on the recently proposed CRL method discrepancy-VAE, that produces representations where each latent factor can be interpreted as the (linear) combination of the activity of a (learned) set of biological processes. To this extent, we present an encoder, SENA-δ, that efficiently compute and map biological processes' activity levels to the latent causal factors. We show that SENA-discrepancy-VAE achieves predictive performances on unseen combinations of interventions that are comparable with its original, non-interpretable counterpart, while inferring causal latent factors that are biologically meaningful.
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.48)
A scalable gene network model of regulatory dynamics in single cells
Bertin, Paul, Viviano, Joseph D., Tejada-Lapuerta, Alejandro, Wang, Weixu, Bauer, Stefan, Theis, Fabian J., Bengio, Yoshua
Single-cell data provide high-dimensional measurements of the transcriptional states of cells, but extracting insights into the regulatory functions of genes, particularly identifying transcriptional mechanisms affected by biological perturbations, remains a challenge. Many perturbations induce compensatory cellular responses, making it difficult to distinguish direct from indirect effects on gene regulation. Modeling how gene regulatory functions shape the temporal dynamics of these responses is key to improving our understanding of biological perturbations. Dynamical models based on differential equations offer a principled way to capture transcriptional dynamics, but their application to single-cell data has been hindered by computational constraints, stochasticity, sparsity, and noise. Existing methods either rely on low-dimensional representations or make strong simplifying assumptions, limiting their ability to model transcriptional dynamics at scale. We introduce a Functional and Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions. Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale, provides improved functional insights into transcriptional mechanisms perturbed by gene knockouts, both in myeloid differentiation and K562 Perturb-seq experiments, and simulates single-cell trajectories of A549 cells following small-molecule perturbations.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
A Bioinformatic Approach Validated Utilizing Machine Learning Algorithms to Identify Relevant Biomarkers and Crucial Pathways in Gallbladder Cancer
Khatun, Rabea, Tasnim, Wahia, Akter, Maksuda, Islam, Md Manowarul, Uddin, Md. Ashraf, Mahmud, Md. Zulfiker, Das, Saurav Chandra
Gallbladder cancer (GBC) is the most frequent cause of disease among biliary tract neoplasms. Identifying the molecular mechanisms and biomarkers linked to GBC progression has been a significant challenge in scientific research. Few recent studies have explored the roles of biomarkers in GBC. Our study aimed to identify biomarkers in GBC using machine learning (ML) and bioinformatics techniques. We compared GBC tumor samples with normal samples to identify differentially expressed genes (DEGs) from two microarray datasets (GSE100363, GSE139682) obtained from the NCBI GEO database. A total of 146 DEGs were found, with 39 up-regulated and 107 down-regulated genes. Functional enrichment analysis of these DEGs was performed using Gene Ontology (GO) terms and REACTOME pathways through DAVID. The protein-protein interaction network was constructed using the STRING database. To identify hub genes, we applied three ranking algorithms: Degree, MNC, and Closeness Centrality. The intersection of hub genes from these algorithms yielded 11 hub genes. Simultaneously, two feature selection methods (Pearson correlation and recursive feature elimination) were used to identify significant gene subsets. We then developed ML models using SVM and RF on the GSE100363 dataset, with validation on GSE139682, to determine the gene subset that best distinguishes GBC samples. The hub genes outperformed the other gene subsets. Finally, NTRK2, COL14A1, SCN4B, ATP1A2, SLC17A7, SLIT3, COL7A1, CLDN4, CLEC3B, ADCYAP1R1, and MFAP4 were identified as crucial genes, with SLIT3, COL7A1, and CLDN4 being strongly linked to GBC development and prediction.
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.94)
- Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology > Gallbladder Cancer (0.63)
QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis
In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. In addition to simplifying the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation, QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research. QuST-LLM is a part of QuST project. The source code is hosted on GitHub and documentation is available at (https://github.com/huangch/qust).
- Oceania > Fiji (0.05)
- North America > United States > California > San Diego County > La Jolla (0.04)
An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's disease
Dolci, Giorgio, Cruciani, Federica, Rahaman, Md Abdur, Abrol, Anees, Chen, Jiayu, Fu, Zening, Galazzo, Ilaria Boscolo, Menegaz, Gloria, Calhoun, Vince D.
Alzheimer's disease (AD) is the most prevalent form of dementia with a progressive decline in cognitive abilities. The AD continuum encompasses a prodormal stage known as Mild Cognitive Impairment (MCI), where patients may either progress to AD or remain stable. In this study, we leveraged structural and functional MRI to investigate the disease-induced grey matter and functional network connectivity changes. Moreover, considering AD's strong genetic component, we introduce SNPs as a third channel. Given such diverse inputs, missing one or more modalities is a typical concern of multimodal methods. We hence propose a novel deep learning-based classification framework where generative module employing Cycle GANs was adopted to impute missing data within the latent space. Additionally, we adopted an Explainable AI method, Integrated Gradients, to extract input features relevance, enhancing our understanding of the learned representations. Two critical tasks were addressed: AD detection and MCI conversion prediction. Experimental results showed that our model was able to reach the SOA in the classification of CN/AD reaching an average test accuracy of $0.926\pm0.02$. For the MCI task, we achieved an average prediction accuracy of $0.711\pm0.01$ using the pre-trained model for CN/AD. The interpretability analysis revealed significant grey matter modulations in cortical and subcortical brain areas well known for their association with AD. Moreover, impairments in sensory-motor and visual resting state network connectivity along the disease continuum, as well as mutations in SNPs defining biological processes linked to amyloid-beta and cholesterol formation clearance and regulation, were identified as contributors to the achieved performance. Overall, our integrative deep learning approach shows promise for AD detection and MCI prediction, while shading light on important biological insights.
- North America > United States > California (0.28)
- Europe > Italy (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Open Problem: Active Representation Learning
Milosevic, Nikola, Müller, Gesine, Huisken, Jan, Scherf, Nico
In this work, we introduce the concept of Active Representation Learning, a novel class of problems that intertwines exploration and representation learning within partially observable environments. We extend ideas from Active Simultaneous Localization and Mapping (active SLAM), and translate them to scientific discovery problems, exemplified by adaptive microscopy. We explore the need for a framework that derives exploration skills from representations that are in some sense actionable, aiming to enhance the efficiency and effectiveness of data collection and model building in the natural sciences.
- Europe > Germany > Lower Saxony > Gottingen (0.15)
- Europe > Germany > Saxony > Leipzig (0.05)
- North America > United States > Oklahoma > Beaver County (0.04)
- Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
- Health & Medicine > Therapeutic Area (0.47)
- Health & Medicine > Diagnostic Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Thought Graph: Generating Thought Process for Biological Reasoning
Hsu, Chi-Yang, Cox, Kyle, Xu, Jiawei, Tan, Zhen, Zhai, Tianhua, Hu, Mengzhou, Pratt, Dexter, Chen, Tianlong, Hu, Ziniu, Ding, Ying
We present the Thought Graph as a novel framework to support complex reasoning and use gene set analysis as an example to uncover semantic relationships between biological processes. Our framework stands out for its ability to provide a deeper understanding of gene sets, significantly surpassing GSEA by 40.28% and LLM baselines by 5.38% based on cosine similarity to human annotations. Our analysis further provides insights into future directions of biological processes naming, and implications for bioinformatics and precision medicine.
- North America > United States > Texas > Travis County > Austin (0.16)
- Asia > Singapore > Central Region > Singapore (0.05)
- North America > United States > California > San Diego County > San Diego (0.05)
- (6 more...)
Iteratively Improving Biomedical Entity Linking and Event Extraction via Hard Expectation-Maximization
Li, Xiaochu, Liu, Minqian, Xu, Zhiyang, Huang, Lifu
Biomedical entity linking and event extraction are two crucial tasks to support text understanding and retrieval in the biomedical domain. These two tasks intrinsically benefit each other: entity linking disambiguates the biomedical concepts by referring to external knowledge bases and the domain knowledge further provides additional clues to understand and extract the biological processes, while event extraction identifies a key trigger and entities involved to describe each biological process which also captures the structural context to better disambiguate the biomedical entities. However, previous research typically solves these two tasks separately or in a pipeline, leading to error propagation. What's more, it's even more challenging to solve these two tasks together as there is no existing dataset that contains annotations for both tasks. To solve these challenges, we propose joint biomedical entity linking and event extraction by regarding the event structures and entity references in knowledge bases as latent variables and updating the two task-specific models in a hard Expectation-Maximization (EM) fashion: (1) predicting the missing variables for each partially annotated dataset based on the current two task-specific models, and (2) updating the parameters of each model on the corresponding pseudo completed dataset. Experimental results on two benchmark datasets: Genia 2011 for event extraction and BC4GO for entity linking, show that our joint framework significantly improves the model for each individual task and outperforms the strong baselines for both tasks. We will make the code and model checkpoints publicly available once the paper is accepted.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Asia > India (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)