openproteinset
A Supplementary materials
A.2 Documentation and intended uses We include a datasheet [1] in Section B. Detailed documentation on the precise structure and content OpenProteinSet is made available under the CC BY 4.0 license. The authors bear all responsibility in case of violation of rights. OpenProteinSet will continue to be hosted on RODA for the foreseeable future. A.7 Alignment tool settings For JackHMMer, we used -N 1 -E 0.0001 -incE 0.0001 -F1 0.0005 -F2 0.00005 -F3 0.0000005 and then capped outputs at depth 5000. B.1 Motivation For what purpose was the dataset created?
- Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
- Law (0.47)
OpenProteinSet: Training data for structural biology at scale
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
A Supplementary materials
A.2 Documentation and intended uses We include a datasheet [1] in Section B. Detailed documentation on the precise structure and content OpenProteinSet is made available under the CC BY 4.0 license. The authors bear all responsibility in case of violation of rights. OpenProteinSet will continue to be hosted on RODA for the foreseeable future. A.7 Alignment tool settings For JackHMMer, we used -N 1 -E 0.0001 -incE 0.0001 -F1 0.0005 -F2 0.00005 -F3 0.0000005 and then capped outputs at depth 5000. B.1 Motivation For what purpose was the dataset created?
- Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
- Law (0.47)
OpenProteinSet: Training data for structural biology at scale
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it.
OpenProteinSet: Training data for structural biology at scale
Ahdritz, Gustaf, Bouatta, Nazim, Kadyan, Sachin, Jarosch, Lukas, Berenberg, Daniel, Fisk, Ian, Watkins, Andrew M., Ra, Stephen, Bonneau, Richard, AlQuraishi, Mohammed
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.