FDA
GramSeq-DTA: A grammar-based drug-target affinity prediction approach fusing gene expression information
Debnath, Kusal, Rana, Pratip, Ghosh, Preetam
Drug-target affinity (DTA) prediction is a critical aspect of drug discovery. The meaningful representation of drugs and targets is crucial for accurate prediction. Using 1D string-based representations for drugs and targets is a common approach that has demonstrated good results in drug-target affinity prediction. However, these approach lacks information on the relative position of the atoms and bonds. To address this limitation, graph-based representations have been used to some extent. However, solely considering the structural aspect of drugs and targets may be insufficient for accurate DTA prediction. Integrating the functional aspect of these drugs at the genetic level can enhance the prediction capability of the models. To fill this gap, we propose GramSeq-DTA, which integrates chemical perturbation information with the structural information of drugs and targets. We applied a Grammar Variational Autoencoder (GVAE) for drug feature extraction and utilized two different approaches for protein feature extraction: Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). The chemical perturbation data is obtained from the L1000 project, which provides information on the upregulation and downregulation of genes caused by selected drugs. This chemical perturbation information is processed, and a compact dataset is prepared, serving as the functional feature set of the drugs. By integrating the drug, gene, and target features in the model, our approach outperforms the current state-of-the-art DTA prediction models when validated on widely used DTA datasets (BindingDB, Davis, and KIBA). This work provides a novel and practical approach to DTA prediction by merging the structural and functional aspects of biological entities, and it encourages further research in multi-modal DTA prediction.
Multi-view biomedical foundation models for molecule-target and property prediction
Suryanarayanan, Parthasarathy, Qiu, Yunguang, Sethi, Shreyans, Mahajan, Diwakar, Li, Hongyang, Yang, Yuxin, Eyigoz, Elif, Saenz, Aldo Guzman, Platt, Daniel E., Rumbell, Timothy H., Ng, Kenney, Dey, Sanjoy, Burch, Myson, Kwon, Bum Chul, Meyer, Pablo, Cheng, Feixiong, Hu, Jianying, Morrone, Joseph A.
Drug discovery is a complex, multi-stage process. Lead identification and lead optimization remain costly with low success-rates and computational methods play an important role in accelerating these tasks [1-3]. The prediction of a broad range of chemical and biological properties of candidate molecules is an essential component of screening and assessing molecules and data-driven, machine learning approaches have long aided in this process [4-6]. Molecular representations form the basis of machine learning models [2, 7], facilitating algorithmic and scientific advances in the field. However, learning useful and generalized latent representation is a hard problem due to limited amounts of labeled data, wide ranges of downstream tasks, vast chemical space, and large heterogeneity in molecular structures. Learning latent representations using unsupervised techniques is vital for such models to scale. Large language models (LLMs) have revolutionized other fields [8] and similar sequence-based foundation models have shown promise to learn molecular representations and be trainable on many downstream property prediction tasks [9-11]. A key advantage is that the transformer based architecture can learn in a self-supervised fashion to create a "pre-trained" molecular representation. The most direct application of LLM like transformers is facilitated by a sequence, text-based representation (e.g.
Automated neuroradiological support systems for multiple cerebrovascular disease markers -- A systematic review and meta-analysis
Phitidis, Jesse, O'Neil, Alison Q., Whiteley, William N., Alex, Beatrice, Wardlaw, Joanna M., Bernabeu, Miguel O., Hernández, Maria Valdés
Cerebrovascular diseases (CVD) can lead to stroke and dementia. Stroke is the second leading cause of death world wide and dementia incidence is increasing by the year. There are several markers of CVD that are visible on brain imaging, including: white matter hyperintensities (WMH), acute and chronic ischaemic stroke lesions (ISL), lacunes, enlarged perivascular spaces (PVS), acute and chronic haemorrhagic lesions, and cerebral microbleeds (CMB). Brain atrophy also occurs in CVD. These markers are important for patient management and intervention, since they indicate elevated risk of future stroke and dementia. We systematically reviewed automated systems designed to support radiologists reporting on these CVD imaging findings. We considered commercially available software and research publications which identify at least two CVD markers. In total, we included 29 commercial products and 13 research publications. Two distinct types of commercial support system were available: those which identify acute stroke lesions (haemorrhagic and ischaemic) from computed tomography (CT) scans, mainly for the purpose of patient triage; and those which measure WMH and atrophy regionally and longitudinally. In research, WMH and ISL were the markers most frequently analysed together, from magnetic resonance imaging (MRI) scans; lacunes and PVS were each targeted only twice and CMB only once. For stroke, commercially available systems largely support the emergency setting, whilst research systems consider also follow-up and routine scans. The systems to quantify WMH and atrophy are focused on neurodegenerative disease support, where these CVD markers are also of significance. There are currently no openly validated systems, commercially, or in research, performing a comprehensive joint analysis of all CVD markers (WMH, ISL, lacunes, PVS, haemorrhagic lesions, CMB, and atrophy).
New Alzheimer's research reveals 'quiet' phase of the disease, before symptoms appear
Fox News contributor Dr. Marc Siegel joins'Fox News Live' to discuss the FDA approving a new Alzheimer treatment drug and the FDA banning bromide vegetable oils. New details have emerged about how Alzheimer's disease affects the brain. Researchers led by the Allen Institute for Brain Science in Seattle and University of Washington Medicine have identified cellular changes in the brains of people with the disease -- and a timeline of when they occur. "Instead of looking at AD just through the usual lens of plaques and tangles, we focused on how specific cell types were changed in each phase," study author Dr. Kyle Travaglini, Ph.D., a scientist at Allen Institute, told Fox News Digital via email. ALZHEIMER'S DISEASE COULD BE SLOWED BY BOOSTING A CERTAIN PROTEIN IN THE BRAIN, RESEARCHERS SAY "We identified two main phases in AD by arranging donors along a continuous disease trajectory -- a slow, early phase with low levels of pathology and no cognitive decline, followed by a later phase where there's a huge buildup of pathology and cognitive decline."
FARM: Functional Group-Aware Representations for Small Molecules
Nguyen, Thao, Huang, Kuan-Hao, Liu, Ge, Burke, Martin D., Diao, Ying, Ji, Heng
We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group-aware tokenization, which directly incorporates functional group information into the representations. This strategic reduction in tokenization granularity is intentionally aligned with key drivers of functional properties (i.e., functional groups), enhancing the model's understanding of chemical language. By expanding the chemical lexicon, FARM more effectively bridges SMILES and natural language, ultimately advancing the model's capacity to predict molecular properties. FARM also represents molecules from two perspectives: by using masked language modeling to capture atom-level features and by employing graph neural networks to encode the whole molecule topology. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks. These results highlight FARM's potential to improve molecular representation learning, with promising applications in drug discovery and pharmaceutical research. Artificial intelligence (AI) has emerged as a transformative tool in accelerating scientific discovery, particularly in drug development. However, one of the central challenges in this field is the scarcity of large labeled datasets required for traditional supervised learning methods. This has shifted the focus towards self-supervised pre-trained models that can extract meaningful patterns from vast amounts of unlabeled molecular data (Shen & Nicolaou, 2019). As a result, the development of robust foundation models for molecular representations is now more critical than ever. Despite significant advancements in other domains, such as natural language processing (NLP) and computer vision, there is still no dominant foundation model tailored to molecular representation in drug discovery (Zhang et al., 2023b). This paper begins to address this pressing gap by introducing an innovative approach that leverages functional group (FG)-aware tokenization in the context of both sequence-based and graph-based molecular representations.
Generative Artificial Intelligence for Navigating Synthesizable Chemical Space
Gao, Wenhao, Luo, Shitong, Coley, Connor W.
We introduce SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space. Unlike traditional molecular generation approaches, we generate synthetic pathways for molecules to ensure that designs are synthetically tractable. By incorporating a scalable transformer architecture and a diffusion module for building block selection, SynFormer surpasses existing models in synthesizable molecular design. We demonstrate SynFormer's effectiveness in two key applications: (1) local chemical space exploration, where the model generates synthesizable analogs of a reference molecule, and (2) global chemical space exploration, where the model aims to identify optimal molecules according to a black-box property prediction oracle. Additionally, we demonstrate the scalability of our approach via the improvement in performance as more computational resources become available. With our code and trained models openly available, we hope that SynFormer will find use across applications in drug discovery and materials science.
BeanCounter: A low-toxicity, large-scale, and open dataset of business-oriented text
Many of the recent breakthroughs in language modeling have resulted from scaling effectively the same model architecture to larger datasets. In this vein, recent work has highlighted performance gains from increasing training dataset size and quality, suggesting a need for novel sources of large-scale datasets. In this work, we introduce BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses' disclosures. We show that this data is indeed novel: less than 0.1% of BeanCounter appears in Common Crawl-based datasets and it is an order of magnitude larger than datasets relying on similar sources. Given the data's provenance, we hypothesize that BeanCounter is comparatively more factual and less toxic than web-based datasets. Exploring this hypothesis, we find that many demographic identities occur with similar prevalence in BeanCounter but with significantly less toxic context relative to other datasets. To demonstrate the utility of BeanCounter, we evaluate and compare two LLMs continually pre-trained on BeanCounter with their base models. We find an 18-33% reduction in toxic generation and improved performance within the finance domain for the continually pretrained models. Collectively, our work suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.
Towards Integrating Epistemic Uncertainty Estimation into the Radiotherapy Workflow
Teichmann, Marvin Tom, Datar, Manasi, Kratzke, Lisa, Vega, Fernando, Ghesu, Florin C.
The precision of contouring target structures and organs-at-risk (OAR) in radiotherapy planning is crucial for ensuring treatment efficacy and patient safety. Recent advancements in deep learning (DL) have significantly improved OAR contouring performance, yet the reliability of these models, especially in the presence of out-of-distribution (OOD) scenarios, remains a concern in clinical settings. This application study explores the integration of epistemic uncertainty estimation within the OAR contouring workflow to enable OOD detection in clinically relevant scenarios, using specifically compiled data. Furthermore, we introduce an advanced statistical method for OOD detection to enhance the methodological framework of uncertainty estimation. Our empirical evaluation demonstrates that epistemic uncertainty estimation is effective in identifying instances where model predictions are unreliable and may require an expert review. Notably, our approach achieves an AUC-ROC of 0.95 for OOD detection, with a specificity of 0.95 and a sensitivity of 0.92 for implant cases, underscoring its efficacy. This study addresses significant gaps in the current research landscape, such as the lack of ground truth for uncertainty estimation and limited empirical evaluations. Additionally, it provides a clinically relevant application of epistemic uncertainty estimation in an FDA-approved and widely used clinical solution for OAR segmentation from Varian, a Siemens Healthineers company, highlighting its practical benefits.
GATher: Graph Attention Based Predictions of Gene-Disease Links
Narganes-Carlon, David, Myatt, Anniek, Mudaliar, Mani, Crowther, Daniel J.
Target selection is crucial in pharmaceutical drug discovery, directly influencing clinical trial success. Despite its importance, drug development remains resource-intensive, often taking over a decade with significant financial costs. High failure rates highlight the need for better early-stage target selection. We present GATher, a graph attention network designed to predict therapeutic gene-disease links by integrating data from diverse biomedical sources into a graph with over 4.4 million edges. GATher incorporates GATv3, a novel graph attention convolution layer, and GATv3HeteroConv, which aggregates transformations for each edge type, enhancing its ability to manage complex interactions within this extensive dataset. Utilizing hard negative sampling and multi-task pre-training, GATher addresses topological imbalances and improves specificity. Trained on data up to 2018 and evaluated through 2024, our results show GATher predicts clinical trial outcomes with a ROC AUC of 0.69 for unmet efficacy failures and 0.79 for positive efficacy. Feature attribution methods, using Captum, highlight key nodes and relationships, enhancing model interpretability. By 2024, GATher improved precision in prioritizing the top 200 clinical trial targets to 14.1%, an absolute increase of over 3.5% compared to other methods. GATher outperforms existing models like GAT, GATv2, and HGT in predicting clinical trial outcomes, demonstrating its potential in enhancing target validation and predicting clinical efficacy and safety.
Democratising Artificial Intelligence for Pandemic Preparedness and Global Governance in Latin American and Caribbean Countries
de Carvalho, Andre, Bonidia, Robson, Kong, Jude Dzevela, Dauhajre, Mariana, Struchiner, Claudio, Goedert, Guilherme, Stadler, Peter F., Walter, Maria Emilia, Sanches, Danilo, Day, Troy, Castro, Marcia, Edmunds, John, Colome-Hidalgo, Manuel, Morban, Demian Arturo Herrera, Franco, Edian F., Ugarte-Gil, Cesar, Espinoza-Lopez, Patricia, Carrasco-Escobar, Gabriel, Rocha, Ulisses
Infectious diseases, transmitted directly or indirectly, are among the leading causes of epidemics and pandemics. Consequently, several open challenges exist in predicting epidemic outbreaks, detecting variants, tracing contacts, discovering new drugs, and fighting misinformation. Artificial Intelligence (AI) can provide tools to deal with these scenarios, demonstrating promising results in the fight against the COVID-19 pandemic. AI is becoming increasingly integrated into various aspects of society. However, ensuring that AI benefits are distributed equitably and that they are used responsibly is crucial. Multiple countries are creating regulations to address these concerns, but the borderless nature of AI requires global cooperation to define regulatory and guideline consensus. Considering this, The Global South AI for Pandemic & Epidemic Preparedness & Response Network (AI4PEP) has developed an initiative comprising 16 projects across 16 countries in the Global South, seeking to strengthen equitable and responsive public health systems that leverage Southern-led responsible AI solutions to improve prevention, preparedness, and response to emerging and re-emerging infectious disease outbreaks. This opinion introduces our branches in Latin American and Caribbean (LAC) countries and discusses AI governance in LAC in the light of biotechnology. Our network in LAC has high potential to help fight infectious diseases, particularly in low- and middle-income countries, generating opportunities for the widespread use of AI techniques to improve the health and well-being of their communities.