AITopics

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
Law (0.47)

Technology:

Information Technology > Artificial Intelligence (0.70)
Information Technology > Biomedical Informatics (0.50)

Neural Information Processing SystemsApr-24-2026, 20:49:30 GMT

OpenProteinSet: Training data for structural biology at scale

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

artificial intelligence, bioinformatics, machine learning, (17 more...)

Country: North America > United States (0.28)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Neural Information Processing SystemsFeb-7-2026, 21:03:58 GMT

A Supplementary materials

A.2 Documentation and intended uses We include a datasheet [1] in Section B. Detailed documentation on the precise structure and content OpenProteinSet is made available under the CC BY 4.0 license. The authors bear all responsibility in case of violation of rights. OpenProteinSet will continue to be hosted on RODA for the foreseeable future. A.7 Alignment tool settings For JackHMMer, we used -N 1 -E 0.0001 -incE 0.0001 -F1 0.0005 -F2 0.00005 -F3 0.0000005 and then capped outputs at depth 5000. B.1 Motivation For what purpose was the dataset created?

artificial intelligence, dataset, openproteinset, (14 more...)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
Law (0.47)

Technology: Information Technology > Artificial Intelligence (0.70)

Neural Information Processing SystemsFeb-7-2026, 21:03:55 GMT

OpenProteinSet: Training data for structural biology at scale

Each row of an MSA is a protein sequence.

bioinformatics, machine learning, natural language, (18 more...)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

AIHubApr-14-2025, 08:45:00 GMT

Repurposing protein folding models for generation with latent diffusion

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. What comes next after protein folding? In PLAID, we develop a method that learns to sample from the latent space of protein folding models to generate new proteins. It can accept compositional function and organism prompts, and can be trained on sequence databases, which are 2-4 orders of magnitude larger than structure databases. Unlike many previous protein structure generative models, PLAID addresses the multimodal co-generation problem setting: simultaneously generating both discrete sequence and continuous all-atom structural coordinates.

generative model, latent space, protein, (15 more...)

AIHub

Country: North America > United States > New York (0.05)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.72)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceAug-10-2023

OpenProteinSet: Training data for structural biology at scale

Ahdritz, Gustaf, Bouatta, Nazim, Kadyan, Sachin, Jarosch, Lukas, Berenberg, Daniel, Fisk, Ian, Watkins, Andrew M., Ra, Stephen, Bonneau, Richard, AlQuraishi, Mohammed

artificial intelligence, bioinformatics, machine learning, (18 more...)

2308.05326

Country: North America > United States > New York (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Artificial IntelligenceDec-20-2022

MDL-based Compressing Sequential Rules

Chen, Xinhong, Gan, Wensheng, Wan, Shicheng, Gu, Tianlong

Nowadays, with the rapid development of the Internet, the era of big data has come. The Internet generates huge amounts of data every day. However, extracting meaningful information from massive data is like looking for a needle in a haystack. Data mining techniques can provide various feasible methods to solve this problem. At present, many sequential rule mining (SRM) algorithms are presented to find sequential rules in databases with sequential characteristics. These rules help people extract a lot of meaningful information from massive amounts of data. How can we achieve compression of mined results and reduce data size to save storage space and transmission time? Until now, there has been little research on the compression of SRM. In this paper, combined with the Minimum Description Length (MDL) principle and under the two metrics (support and confidence), we introduce the problem of compression of SRM and also propose a solution named ComSR for MDL-based compressing of sequential rules based on the designed sequential rule coding scheme. To our knowledge, we are the first to use sequential rules to encode an entire database. A heuristic method is proposed to find a set of compact and meaningful sequential rules as much as possible. ComSR has two trade-off algorithms, ComSR_non and ComSR_ful, based on whether the database can be completely compressed. Experiments done on a real dataset with different thresholds show that a set of compact and meaningful sequential rules can be found. This shows that the proposed method works.

artificial intelligence, data mining, machine learning, (16 more...)

2212.10252

Country: Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology (0.68)
Materials > Metals & Mining (0.66)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory > Minimum Complexity Machines (0.70)

Chen, Lili, Gan, Wensheng, Chen, Chien-Ming

Towards Correlated Sequential Rules

arXiv.org Artificial IntelligenceOct-27-2022

The goal of high-utility sequential pattern mining (HUSPM) is to efficiently discover profitable or useful sequential patterns in a large number of sequences. However, simply being aware of utility-eligible patterns is insufficient for making predictions. To compensate for this deficiency, high-utility sequential rule mining (HUSRM) is designed to explore the confidence or probability of predicting the occurrence of consequence sequential patterns based on the appearance of premise sequential patterns. It has numerous applications, such as product recommendation and weather prediction. However, the existing algorithm, known as HUSRM, is limited to extracting all eligible rules while neglecting the correlation between the generated sequential rules. To address this issue, we propose a novel algorithm called correlated high-utility sequential rule miner (CoUSR) to integrate the concept of correlation into HUSRM. The proposed algorithm requires not only that each rule be correlated but also that the patterns in the antecedent and consequent of the high-utility sequential rule be correlated. The algorithm adopts a utility-list structure to avoid multiple database scans. Additionally, several pruning strategies are used to improve the algorithm's efficiency and performance. Based on several real-world datasets, subsequent experiments demonstrated that CoUSR is effective and efficient in terms of operation time and memory consumption.

data mining, machine learning, pattern recognition, (20 more...)

2210.15637

Country:

Asia > China > Guangdong Province > Guangzhou (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Shandong Province > Qingdao (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.67)

Wang, Xin, Kadioglu, Serdar

Dichotomic Pattern Mining with Applications to Intent Prediction from Semi-Structured Clickstream Datasets

arXiv.org Artificial IntelligenceJan-23-2022

We introduce a pattern mining framework that operates on semi-structured datasets and exploits the dichotomy between outcomes. Our approach takes advantage of constraint reasoning to find sequential patterns that occur frequently and exhibit desired properties. This allows the creation of novel pattern embeddings that are useful for knowledge extraction and predictive modeling. Finally, we present an application on customer intent prediction from digital clickstream data. Overall, we show that pattern embeddings play an integrator role between semi-structured data and machine learning models, improve the performance of the downstream task and retain interpretability.

intent prediction, prediction, sequence, (13 more...)

2201.09178

Country: North America > United States (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.80)

Lonlac, Jerry, Doniec, Arnaud, Lujak, Marin, Lecoeuche, Stephane

Extracting Seasonal Gradual Patterns from Temporal Sequence Data Using Periodic Patterns Mining

arXiv.org Artificial IntelligenceOct-20-2020

Mining frequent episodes aims at recovering sequential patterns from temporal data sequences, which can then be used to predict the occurrence of related events in advance. On the other hand, gradual patterns that capture co-variation of complex attributes in the form of " when X increases/decreases, Y increases/decreases" play an important role in many real world applications where huge volumes of complex numerical data must be handled. Recently, these patterns have received attention from the data mining community exploring temporal data who proposed methods to automatically extract gradual patterns from temporal data. However, to the best of our knowledge, no method has been proposed to extract gradual patterns that regularly appear at identical time intervals in many sequences of temporal data, despite the fact that such patterns may add knowledge to certain applications, such as e-commerce. In this paper, we propose to extract co-variations of periodically repeating attributes from the sequences of temporal data that we call seasonal gradual patterns. For this purpose, we formulate the task of mining seasonal gradual patterns as the problem of mining periodic patterns in multiple sequences and then we exploit periodic pattern mining algorithms to extract seasonal gradual patterns. We discuss specific features of these patterns and propose an approach for their extraction based on mining periodic frequent patterns common to multiple sequences. We also propose a new anti-monotonous support definition associated to these seasonal gradual patterns. The illustrative results obtained from some real world data sets show that the proposed approach is efficient and that it can extract small sets of patterns by filtering numerous nonseasonal patterns to identify the seasonal ones.

data mining, machine learning, pattern recognition, (18 more...)

2010.10289

Country:

Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
South America > Brazil (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry:

Banking & Finance (1.00)
Health & Medicine (0.68)
Information Technology > Services > e-Commerce Services (0.34)
Materials > Metals & Mining (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)