Goto

Collaborating Authors

 Pharmaceuticals & Biotechnology


Absorb & Escape: Overcoming Single Model Limitations in Generating Heterogeneous Genomic Sequences

Neural Information Processing Systems

Recent advances in immunology and synthetic biology have accelerated the development of deep generative methods for DNA sequence design. Two dominant approaches in this field are AutoRegressive (AR) models and Diffusion Models (DMs). However, genomic sequences are functionally heterogeneous, consisting of multiple connected regions (e.g., Promoter Regions, Exons, and Introns) where elements within each region come from the same probability distribution, but the overall sequence is non-homogeneous. This heterogeneous nature presents challenges for a single model to accurately generate genomic sequences. In this paper, we analyze the properties of AR models and DMs in heterogeneous genomic sequence generation, pointing out crucial limitations in both methods: (i) AR models capture the underlying distribution of data by factorizing and learning the transition probability but fail to capture the global property of DNA sequences.


A benchmark for prediction of transcriptomic responses to chemical perturbations across cell types

Neural Information Processing Systems

Single-cell transcriptomics has revolutionized our understanding of cellular heterogeneity and drug perturbation effects. To overcome these limitations, several groups have proposed using machine learning methods to directly predict the effect of chemical perturbations either across cell contexts or chemical space. However, advances in this field have been hindered by a lack of well-designed evaluation datasets and benchmarks. To drive innovation in perturbation modeling, the Open Problems Perturbation Prediction (OP3) benchmark introduces a framework for predicting the effects of small molecule perturbations on cell type-specific gene expression. OP3 leverages the Open Problems in Single-cell Analysis benchmarking infrastructure and is enabled by a new single-cell perturbation dataset, encompassing 146 compounds tested on human blood cells. The benchmark includes diverse data representations, evaluation metrics, and winning methods from our "Single-cell perturbation prediction: generalizing experimental interventions to unseen contexts" competition at NeurIPS 2023.


On the Scalability of GNNs for Molecular Graphs

Neural Information Processing Systems

Scaling deep learning models has been at the heart of recent revolutions in language modelling and image generation. Practitioners have observed a strong relationship between model size, dataset size, and performance. However, structure-based architectures such as Graph Neural Networks (GNNs) are yet to show the benefits of scale mainly due to lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures. We address this drawback of GNNs by studying their scaling behavior. Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs for supervised pretraining.


UniTox: Leveraging LLMs to Curate a Unified Dataset of Drug-Induced Toxicity from FDA Labels

Neural Information Processing Systems

Drug-induced toxicity is one of the leading reasons new drugs fail clinical trials. Machine learning models that predict drug toxicity from molecular structure could help researchers prioritize less toxic drug candidates. However, current toxicity datasets are typically small and limited to a single organ system (e.g., cardio, renal, or liver). Creating these datasets often involved time-intensive expert curation by parsing drug labelling documents that can exceed 100 pages per drug. Here, we introduce UniTox, a unified dataset of 2,418 FDA-approved drugs with drug-induced toxicity summaries and ratings created by using GPT-4o to process FDA drug labels.


Association Pattern-aware Fusion for Biological Entity Relationship Prediction

Neural Information Processing Systems

Deep learning-based methods significantly advance the exploration of associations among triple-wise biological entities (e.g., drug-target protein-adverse reaction), thereby facilitating drug discovery and safeguarding human health. However, existing researches only focus on entity-centric information mapping and aggregation, neglecting the crucial role of potential association patterns among different entities. To address the above limitation, we propose a novel association pattern-aware fusion method for biological entity relationship prediction, which effectively integrates the related association pattern information into entity representation learning. Additionally, to enhance the missing information of the low-order message passing, we devise a bind-relation module that considers the strong bind of low-order entity associations. Extensive experiments conducted on three biological datasets quantitatively demonstrate that the proposed method achieves about 4%-23% hit@1 improvements compared with state-of-the-art baselines. Furthermore, the interpretability of association patterns is elucidated in detail, thus revealing the intrinsic biological mechanisms and promoting it to be deployed in real-world scenarios.


AdaNovo: Towards Robust \emph{De Novo} Peptide Sequencing in Proteomics against Data Biases

Neural Information Processing Systems

Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the high-throughput analysis of protein composition in biological tissues. Despite the development of several deep learning methods for predicting amino acid sequences (peptides) responsible for generating the observed mass spectra, training data biases hinder further advancements of \emph{de novo} peptide sequencing. Firstly, prior methods struggle to identify amino acids with Post-Translational Modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in unsatisfactory peptide sequencing performance. Secondly, various noise and missing peaks in mass spectra reduce the reliability of training data (Peptide-Spectrum Matches, PSMs). To address these challenges, we propose AdaNovo, a novel and domain knowledge-inspired framework that calculates Conditional Mutual Information (CMI) between the mass spectra and amino acids or peptides, using CMI for robust training against above biases.


Where has the left's technological audacity gone? Leigh Phillips

The Guardian

Techno-optimism โ€“ the belief that technology will usher in a golden age for humanity โ€“ is in vogue once more. In 2022, a clutch of pseudonymous San Francisco artificial intelligence (AI) scenesters published a Substack post entitled "Effective Accelerationism", which argued for maximum acceleration of technological advancement. The 10-point manifesto, which proclaimed that "the next evolution of consciousness, creating unthinkable next-generation lifeforms and silicon-based awareness" was imminent, quickly went viral, as did follow-up posts. Effective accelerationism, or "e/acc", exploded from being a fringe movement dedicated to pushing back against AI extinction-fearing "doomers" to being namechecked by major Silicon Valley CEOs such as Garry Tan, the CEO of start-up accelerator Y Combinator; Sam Altman, head of OpenAI; Marc Andreessen, the billionaire software engineer; and Elon Musk. In 2023, Andreessen issued his Techno-Optimist Manifesto, expanding beyond the e/acc's focus on AI to encompass all questions of technological progress.


The left needs to abandon its miserable, irrational pessimism Aaron Bastani

The Guardian

At the start of the millennium it was widely presumed each successive generation would achieve a higher level of prosperity than the last. Today that is no longer the case. Just 19% of Americans expect their children's lives to be better than their own, while two-thirds believe their country will be economically weaker by 2050. So our zeitgeist is increasingly one of pessimism, from anxiety about the climate crisis to concern over rising inequality. According to the historian Adam Tooze, we are living through a "polycrisis" โ€“ where such challenges are not only simultaneous but mutually reinforcing.


Facebook scammers want you to think Elon Musk can cure diabetes

Engadget

Elon Musk discovered a simple 30-second "fridge trick" that can reverse diabetes, but the discovery has spooked pharmaceutical companies so much they put a 78 million bounty on his head, forcing the Tesla CEO to flee the country. At least, that's what a collection of AI-generated Facebook ads claim. Facebook ads depicting deepfakes of Elon Musk and Fox News personalities claiming that the Tesla CEO has discovered the cure for diabetes have been circulating on the platform for weeks. The ads seem to be part of a wider scam that uses the deepfakes to sell unproven supplements. Engadget has identified scores of pages running versions of these ads since early February.


Scientists Have Bred Woolly Mice on Their Journey to Bring Back the Mammoth

TIME - Tech

Recreating the species from that raw biological material is relatively straightforward in principle, if exceedingly painstaking in practice. The work involves pinpointing the genes responsible for the traits that separate the mammoth from the Asian elephant--its close evolutionary relation--editing an elephant stem cell to express those traits, and introducing the stem cell into an elephant embryo. In the alternative, scientists could edit a newly conceived Asian elephant zygote directly. Either way, the next step would be to implant the resulting embryo into the womb of a modern-day female elephant. After 22 months--the typical elephant gestation period--an ice age mammoth should, at least theoretically, be born into the computer-age world.