Goto

Collaborating Authors

 enzyme


Nobel prizewinner Omar Yaghi says his invention will change the world

New Scientist

Chemist Omar Yaghi invented materials called MOFs, a few grams of which have the surface area of a football field. In school, we learn about the Stone Age, the Bronze Age - and we are currently in a silicon age characterised by computers and phones. What might define the next age? Omar Yaghi at the University of California, Berkeley, thinks a family of materials he helped pioneer in the 1990s has a good shot. They are metal-organic frameworks (MOFs), and working out how to make them earned him a share of the 2025 Nobel prize in chemistry .


Flu Is Relentless. Crispr Might Be Able to Shut It Down

WIRED

Innovative research into the gene-editing tool targets influenza's ability to replicate--stopping it in its tracks. As he addressed an audience of virologists from China, Australia, and Singapore at October's Pandemic Research Alliance Symposium, Wei Zhao introduced an eye-catching idea. The gene-editing technology Crispr is best known for delivering groundbreaking new therapies for rare diseases, tweaking or knocking out rogue genes in conditions ranging from sickle cell disease to hemophilia . But Zhao and his colleagues at Melbourne's Peter Doherty Institute for Infection and Immunity have envisioned a new application. They believe Crispr could be tailored to create a next-generation treatment for influenza, whether that's the seasonal strains which plague both the northern and southern hemispheres on an annual basis, or the worrisome new variants in birds and other wildlife that might trigger the next pandemic.


ReactZyme: A Benchmark for Enzyme-Reaction Prediction

Neural Information Processing Systems

Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies.Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes.Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation https://github.com/WillHua127/ReactZyme.


Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients

Neural Information Processing Systems

Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme synthesizes gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. On a machine-learning focused benchmark suite including Microsoft's ADBench, AD on optimized IR achieves a geometric mean speedup of 4.2 times over AD on IR before optimization allowing Enzyme to achieve state-of-the-art performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the-art performance, enabling foreign code to be directly incorporated into existing machine learning workflows.


CARE: a Benchmark Suite for the Classification and Retrieval of Enzymes

Neural Information Processing Systems

Enzymes are important proteins that catalyze chemical reactions. In recent years, machine learning methods have emerged to predict enzyme function from sequence; however, there are no standardized benchmarks to evaluate these methods. We introduce CARE, a benchmark and dataset suite for the Classification And Retrieval of Enzymes (CARE). CARE centers on two tasks: (1) classification of a protein sequence by its enzyme commission (EC) number and (2) retrieval of an EC number given a chemical reaction. For each task, we design train-test splits to evaluate different kinds of out-of-distribution generalization that are relevant to real use cases. For the classification task, we provide baselines for state-of-the-art methods. Because the retrieval task has not been previously formalized, we propose a method called Contrastive Reaction-EnzymE Pretraining (CREEP) as one of the first baselines for this task and compare it to the recent method, CLIPZyme. CARE is available at https://github.com/jsunn-y/CARE/.


Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design

Tavakoli, Amin, Murugan, Raswanth, Gokdemir, Ozan, Ramanathan, Arvind, Arnold, Frances, Anandkumar, Anima

arXiv.org Artificial Intelligence

Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM's output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome-scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine-tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.


Fused Gromov-Wasserstein Contrastive Learning for Effective Enzyme-Reaction Screening

Zhou, Gengmo, Yu, Feng, Wang, Wenda, Gao, Zhifeng, Ke, Guolin, Wei, Zhewei, Wang, Zhen

arXiv.org Artificial Intelligence

Enzymes are crucial catalysts that enable a wide range of biochemical reactions. Efficiently identifying specific enzymes from vast protein libraries is essential for advancing biocatalysis. Traditional computational methods for enzyme screening and retrieval are time-consuming and resource-intensive. Recently, deep learning approaches have shown promise. However, these methods focus solely on the interaction between enzymes and reactions, overlooking the inherent hierarchical relationships within each domain. To address these limitations, we introduce FGW-CLIP, a novel contrastive learning framework based on optimizing the fused Gromov-Wasserstein distance. FGW-CLIP incorporates multiple alignments, including inter-domain alignment between reactions and enzymes and intra-domain alignment within enzymes and reactions. By introducing a tailored regularization term, our method minimizes the Gromov-Wasserstein distance between enzyme and reaction spaces, which enhances information integration across these domains. Extensive evaluations demonstrate the superiority of FGW-CLIP in challenging enzyme-reaction tasks. On the widely-used EnzymeMap benchmark, FGW-CLIP achieves state-of-the-art performance in enzyme virtual screening, as measured by BEDROC and EF metrics. Moreover, FGW-CLIP consistently outperforms across all three splits of ReactZyme, the largest enzyme-reaction benchmark, demonstrating robust generalization to novel enzymes and reactions. These results position FGW-CLIP as a promising framework for enzyme discovery in complex biochemical settings, with strong adaptability across diverse screening scenarios.


EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants

Khan, Anas Aziz, Fahad, Md Shah, Priyanka, null, Chandra, Ramesh, Singh, Guransh

arXiv.org Artificial Intelligence

Accurate prediction of enzyme kinetic parameters is crucial for drug discovery, metabolic engineering, and synthetic biology applications. Current computational approaches face limitations in capturing complex enzyme-substrate interactions and often focus on single parameters while neglecting the joint prediction of catalytic turnover numbers (Kcat) and Michaelis-Menten constants (Km). We present EnzyCLIP, a novel dual-encoder framework that leverages contrastive learning and cross-attention mechanisms to predict enzyme kinetic parameters from protein sequences and substrate molecular structures. Our approach integrates ESM-2 protein language model embeddings with ChemBERTa chemical representations through a CLIP-inspired architecture enhanced with bidirectional cross-attention for dynamic enzyme-substrate interaction modeling. EnzyCLIP combines InfoNCE contrastive loss with Huber regression loss to learn aligned multimodal representations while predicting log10-transformed kinetic parameters. The model is trained on the CatPred-DB database containing 23,151 Kcat and 41,174 Km experimentally validated measurements, and achieved competitive performance with R2 scores of 0.593 for Kcat and 0.607 for Km prediction. XGBoost ensemble methods applied to the learned embeddings further improved Km prediction (R2 = 0.61) while maintaining robust Kcat performance.


supp

Neural Information Processing Systems

Dataset documentation and intended uses: 1. Introduction The current methodologies for enzyme annotation primarily rely on established databases and classi fi cations such as KEGG Orthology (KO), Enzyme Commission (EC) numbers, and Gene Ontology (GO) annotations, each with its speci fi c focus and methodology. For instance, the EC system categorizes enzymes based on the chemical reactions they catalyze, providing a hierarchical numerical classi fi cation. KO links gene products to their functional orthologs across different species, whereas GO offers a broader ontology for describing the roles of genes and proteins in any organism. Despite their widespread use, these systems have notable limitations. The EC classi fi cation, while widely used, sometimes groups vastly different enzymes under the same category or subdivides similar ones excessively, based on the substrates they interact with--leading to ambiguities in enzyme function characterization.