Goto

Collaborating Authors

 alphafold2



OpenProteinSet: Training data for structural biology at scale

Neural Information Processing Systems

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.


Virtual Collaboration

Communications of the ACM

The holy grail for scientists is to focus on their research to enhance and produce scientific discoveries while offloading time-consuming tasks. So-called artificial intelligence (AI) co-scientists are helping to make this possible. These collaborative AI systems are designed to assist human researchers by accelerating scientific discovery, enhancing collaboration, analyzing data, and going beyond human intuition. An AI co-scientist performs various scientific tasks, especially in areas like hypothesis generation, experimental design, verification, and literature review. It uses the results to learn to improve its ability to generate and refine hypotheses.


SimpleFold: Folding Proteins is Simpler than You Think

Wang, Yuyang, Lu, Jiarui, Jaitly, Navdeep, Susskind, Josh, Bautista, Miguel Angel

arXiv.org Artificial Intelligence

Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across different but related problems, it is natural to question whether these architectural designs are a necessary condition to build performant models. In this paper, we introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer blocks. Protein folding models typically employ computationally expensive modules involving triangular updates, explicit pair representations or multiple training objectives curated for this specific domain. Instead, SimpleFold employs standard transformer blocks with adaptive layers and is trained via a generative flow-matching objective with an additional structural term. We scale SimpleFold to 3B parameters and train it on approximately 9M distilled protein structures together with experimental PDB data. On standard folding benchmarks, SimpleFold-3B achieves competitive performance compared to state-of-the-art baselines, in addition SimpleFold demonstrates strong performance in ensemble prediction which is typically difficult for models trained via deterministic reconstruction objectives. Due to its general-purpose architecture, SimpleFold shows efficiency in deployment and inference on consumer-level hardware. SimpleFold challenges the reliance on complex domain-specific architectures designs in protein folding, opening up an alternative design space for future progress.


A Multimodal Human Protein Embeddings Database: DeepDrug Protein Embeddings Bank (DPEB)

Sajol, Md Saiful Islam, Rajasekaran, Magesh, Gemeinhardt, Hayden, Bess, Adam, Alvin, Chris, Mukhopadhyay, Supratik

arXiv.org Artificial Intelligence

Computationally predicting protein-protein interactions (PPIs) is challenging due to the lack of integrated, multimodal protein representations. DPEB is a curated collection of 22,043 human proteins that integrates four embedding types: structural (AlphaFold2), transformer-based sequence (BioEmbeddings), contextual amino acid patterns (ESM-2: Evolutionary Scale Modeling), and sequence-based n-gram statistics (ProtVec]). AlphaFold2 protein structures are available through public databases (e.g., AlphaFold2 Protein Structure Database), but the internal neural network embeddings are not. DPEB addresses this gap by providing AlphaFold2-derived embeddings for computational modeling. Our benchmark evaluations show GraphSAGE with BioEmbedding achieved the highest PPI prediction performance (87.37% AUROC, 79.16% accuracy). The framework also achieved 77.42% accuracy for enzyme classification and 86.04% accuracy for protein family classification. DPEB supports multiple graph neural network methods for PPI prediction, enabling applications in systems biology, drug target identification, pathway analysis, and disease mechanism studies.



ImmunoAI: Accelerated Antibody Discovery Using Gradient-Boosted Machine Learning with Thermodynamic-Hydrodynamic Descriptors and 3D Geometric Interface Topology

Shivakumar, Shawnak, Sandora, Matthew

arXiv.org Artificial Intelligence

Human metapneumovirus (hMPV) poses serious risks to pediatric, elderly, and immunocompromised populations. Traditional antibody discovery pipelines require 10-12 months, limiting their applicability for rapid outbreak response. This project introduces ImmunoAI, a machine learning framework that accelerates antibody discovery by predicting high-affinity candidates using gradient-boosted models trained on thermodynamic, hydrodynamic, and 3D topological interface descriptors. A dataset of 213 antibody-antigen complexes was curated to extract geometric and physicochemical features, and a LightGBM regressor was trained to predict binding affinity with high precision. The model reduced the antibody candidate search space by 89%, and fine-tuning on 117 SARS-CoV-2 binding pairs further reduced Root Mean Square Error (RMSE) from 1.70 to 0.92. In the absence of an experimental structure for the hMPV A2.2 variant, AlphaFold2 was used to predict its 3D structure. The fine-tuned model identified two optimal antibodies with predicted picomolar affinities targeting key mutation sites (G42V and E96K), making them excellent candidates for experimental testing. In summary, ImmunoAI shortens design cycles and enables faster, structure-informed responses to viral outbreaks.


dyAb: Flow Matching for Flexible Antibody Design with AlphaFold-driven Pre-binding Antigen

Tan, Cheng, Zhang, Yijie, Gao, Zhangyang, Huang, Yufei, Lin, Haitao, Wu, Lirong, Wu, Fandi, Blanchette, Mathieu, Li, Stan. Z.

arXiv.org Artificial Intelligence

The development of therapeutic antibodies heavily relies on accurate predictions of how antigens will interact with antibodies. Existing computational methods in antibody design often overlook crucial conformational changes that antigens undergo during the binding process, significantly impacting the reliability of the resulting antibodies. To bridge this gap, we introduce dyAb, a flexible framework that incorporates AlphaFold2-driven predictions to model pre-binding antigen structures and specifically addresses the dynamic nature of antigen conformation changes. Our dyAb model leverages a unique combination of coarse-grained interface alignment and fine-grained flow matching techniques to simulate the interaction dynamics and structural evolution of the antigen-antibody complex, providing a realistic representation of the binding process. Extensive experiments show that dyAb significantly outperforms existing models in antibody design involving changing antigen conformations. These results highlight dyAb's potential to streamline the design process for therapeutic antibodies, promising more efficient development cycles and improved outcomes in clinical applications.


A Model-Centric Review of Deep Learning for Protein Design

Kyro, Gregory W., Qiu, Tianyin, Batista, Victor S.

arXiv.org Artificial Intelligence

Deep learning has transformed protein design, enabling accurate structure prediction, sequence optimization, and de novo protein generation. Advances in single - chain protein structure prediction via AlphaFold2, RoseTTAFold, ESM Fold, and others have achieved near - experimental accuracy, inspiring successive work extended to biomolecular complexes via AlphaFold Multimer, RoseTTAFold All - Atom, AlphaFold 3, Chai - 1, Boltz - 1 and others . Generative models such as Prot GPT 2, ProteinMPNN, and RFdiffusion have enabled sequence and backbone design beyond natural evolution - based limitations . More recently, joint sequence - structure co - design models, including ESM 3, have integrated both modalities into a unified framework, resulting in improved designability. Despite these advances, challenges still exist pertaining to modeling sequence - structure - function relationships and ensuring robust generalization beyond the regions of protein space spanned by the training data . Future advances wi ll likely focus on joint sequence - structure - function co - design frameworks that are able to model the fitness landscape more effectively than models that treat these modalities independently . Current capabilities, coupled with the dizzying rate of progress, suggest that the field will soon enable rapid, rational design of proteins with tailored structures and functions that transcend the limitations imposed by natural evolution. In this review, we discuss the current capabilities of deep learning methods for protein design, f ocusing on some of the most revolutionary and capable models with respect to their functionality and the applications that they enable, leading up to the current challenges of the field and the optimal path forward.


Nobel Prizes in physics and chemistry awarded for machine learning research

AIHub

The 2024 Nobel Prizes for physics and chemistry were announced on 8 and 9 October respectively. Both prizes were awarded for work enabling or using machine learning. More specifically, Hopfield is recognised for "inventing a network that uses a method for saving and recreating patterns". This Hopfield network utilises physics that describes a material's characteristics due to its atomic spin. The network as a whole is described in a manner equivalent to the energy in the spin system found in physics, and is trained by finding values for the connections between the nodes so that the saved images have low energy.