Deep Confident Steps to New Pockets: Strategies for Docking Generalization

Corso, Gabriele, Deng, Arthur, Fry, Benjamin, Polizzi, Nicholas, Barzilay, Regina, Jaakkola, Tommi

Feb-28-2024–arXiv.org Artificial Intelligence

Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Understanding how small molecules and proteins interact, a task known as molecular docking, is at the heart of drug discovery. The conventional use of docking in the industry has led the field to focus on finding binding conformations when restricting the search to predefined pockets and evaluating these on a relatively limited set of protein families of commercial interest. For example, it would help us understand the mechanism of action of new drugs to accelerate their development [Schottlender et al., 2022], predict adverse side-effects of drugs before clinical trials [Luo et al., 2018], and discover the function of the vast number of enzymes and membrane proteins whose biology we do not yet know [Yi et al., 2015]. All these tasks critically require the docking methods to generalize beyond the relatively small class of well-studied proteins for which we have many available structures. Existing docking benchmarks are largely built on collections of similar binding modes and fail to rigorously assess the ability of docking methods to generalize across the proteome. Gathering diverse data for protein-ligand interactions is challenging because binding pockets tend to be evolutionarily well-conserved due to their critical biological functions. Therefore, a large proportion of known interactions fall into a relatively small set of common binding modes. The results show that increasing both data and model can give significant generalization improvements.

confidence model, ligand, ootstrapping, (17 more...)

arXiv.org Artificial Intelligence

Feb-28-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts > Middlesex County
    - Cambridge (0.04)
  - California > Alameda County
    - Berkeley (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.93)
  - Representation & Reasoning > Search (0.68)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found