AITopics | openproteinset

materials

Neural Information Processing SystemsApr-24-2026, 20:49:33 GMT

A.1 Access instructions OpenProteinSet is hosted by the Registry of Open Data on AWS (RODA) and can be accessed at the following link: registry.opendata.aws/openfold/. A.2 Documentation and intended uses We include a datasheet [1] in Section B. Detailed documentation on the precise structure and content of the dataset is provided on the dataset's landing page. A.3 Data format All OpenProteinSet files are in standard plaintext formats (A3M for MSAs, HHSearch format for template hits, and PDB for structure files) that can be read by a wide variety of bioinformatics software. A.5 License OpenProteinSet is made available under the CCBY 4.0 license. A copy of the license is provided with the dataset.

artificial intelligence, bioinformatics, dataset, (15 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
Law (0.47)

Technology:

Information Technology > Artificial Intelligence (0.70)
Information Technology > Biomedical Informatics (0.50)

Add feedback

OpenProteinSet: Training data for structural biology at scale

Neural Information Processing SystemsApr-24-2026, 20:49:30 GMT

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

artificial intelligence, bioinformatics, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

A Supplementary materials

Neural Information Processing SystemsFeb-7-2026, 21:03:58 GMT

A.2 Documentation and intended uses We include a datasheet [1] in Section B. Detailed documentation on the precise structure and content OpenProteinSet is made available under the CC BY 4.0 license. The authors bear all responsibility in case of violation of rights. OpenProteinSet will continue to be hosted on RODA for the foreseeable future. A.7 Alignment tool settings For JackHMMer, we used -N 1 -E 0.0001 -incE 0.0001 -F1 0.0005 -F2 0.00005 -F3 0.0000005 and then capped outputs at depth 5000. B.1 Motivation For what purpose was the dataset created?

artificial intelligence, dataset, openproteinset, (14 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
Law (0.47)

Technology: Information Technology > Artificial Intelligence (0.70)

Add feedback

OpenProteinSet: Training data for structural biology at scale

Neural Information Processing SystemsFeb-7-2026, 21:03:55 GMT

Each row of an MSA is a protein sequence.

bioinformatics, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

OpenProteinSet: Training data for structural biology at scale

Neural Information Processing SystemsDec-23-2025, 21:57:39 GMT

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

openproteinset, structural biology, training data, (6 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.81)

Add feedback

OpenProteinSet: Training data for structural biology at scale

Neural Information Processing SystemsOct-9-2024, 17:42:29 GMT

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it.

openproteinset, structural biology, training data, (3 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.99)

Add feedback

OpenProteinSet: Training data for structural biology at scale

Ahdritz, Gustaf, Bouatta, Nazim, Kadyan, Sachin, Jarosch, Lukas, Berenberg, Daniel, Fisk, Ian, Watkins, Andrew M., Ra, Stephen, Bonneau, Richard, AlQuraishi, Mohammed

arXiv.org Artificial IntelligenceAug-10-2023

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

artificial intelligence, bioinformatics, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2308.05326

Country: North America > United States > New York (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: