protein language model
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.05)
- Europe > Italy > Tuscany > Florence (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.67)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France (0.04)
- North America > United States > Virginia (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction
Amino acid substitution rate matrices are fundamental to statistical phylogenetics and evolutionary biology. Estimating them typically requires reconstructed trees for massive amounts of aligned proteins, which poses a major computational bottleneck. In this paper, we develop a near-linear time method to estimate these rate matrices from multiple sequence alignments (MSAs) alone, thereby speeding up computation by orders of magnitude. Our method relies on a near-linear time cherry reconstruction algorithm which we call FastCherries and it can be easily applied to MSAs with millions of sequences. On both simulated and real data, we demonstrate the speed and accuracy of our method as applied to the classical model of protein evolution. By leveraging the unprecedented scalability of our method, we develop a new, rich phylogenetic model called SiteRM, which can estimate a general site-specific rate matrix for each column of an MSA. Remarkably, in variant effect prediction for both clinical and deep mutational scanning data in ProteinGym, we show that despite being an independent-sites model, our SiteRM model outperforms large protein language models that learn complex residue-residue interactions between different sites. We attribute our increased performance to conceptual advances in our probabilistic treatment of evolutionary data and our ability to handle extremely large MSAs. We anticipate that our work will have a lasting impact across both statistical phylogenetics and computational variant effect prediction.
PoET: A generative model of protein families as sequences-of-sequences
Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families.