We outline recent developments in artificial intelligence (AI) and machine learning (ML) techniques for integrative structural biology of intrinsically disordered proteins (IDP) ensembles. IDPs challenge the traditional protein structure-function paradigm by adapting their conformations in response to specific binding partners leading them to mediate diverse, and often complex cellular functions such as biological signaling, self organization and compartmentalization. Obtaining mechanistic insights into their function can therefore be challenging for traditional structural determination techniques. Often, scientists have to rely on piecemeal evidence drawn from diverse experimental techniques to characterize their functional mechanisms. Multiscale simulations can help bridge critical knowledge gaps about IDP structure function relationships - however, these techniques also face challenges in resolving emergent phenomena within IDP conformational ensembles. We posit that scalable statistical inference techniques can effectively integrate information gleaned from multiple experimental techniques as well as from simulations, thus providing access to atomistic details of these emergent phenomena.
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%+-0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined.
Riback et al. (Reports, 13 October 2017, p. 238) used small-angle x-ray scattering (SAXS) experiments to infer a degree of compaction for unfolded proteins in water versus chemical denaturant that is highly consistent with the results from Förster resonance energy transfer (FRET) experiments. There is thus no "contradiction" between the two methods, nor evidence to support their claim that commonly used FRET fluorophores cause protein compaction. Riback et al. (1) recently presented a "molecular form factor" (MFF) method addressing the well-known challenges (2) of analyzing small-angle x-ray scattering (SAXS) data for unfolded or intrinsically disordered proteins (IDPs) (3, 4). Combined with the precision of SAXS measurements coupled to size exclusion chromatography, their method yielded the following results: (i) Unfolded proteins in water have a polymer scaling exponent, near the theta-solvent condition where protein-protein and protein-solvent interactions are balanced; in denaturant, this increases to, the limit where the protein-solvent interactions dominate. We are pleased that these findings are in overall agreement with SAXS and Förster resonance energy transfer (FRET) studies from our laboratories (3, 5, 6) and others (4).
As battles to contain the COVID-19 pandemic continue, attention is focused on emerging variants of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus that have been deemed variants of concern because they are resistant to antibodies elicited by infection or vaccination or they increase transmissibility or disease severity. Three papers used functional and structural studies to explore how mutations in the viral spike protein affect its ability to infect host cells and to evade host immunity. Gobeil et al. looked at a variant spike protein involved in transmission between minks and humans, as well as the B1.1.7 (alpha), B.1.351 (beta), and P1 (gamma) spike variants; Cai et al. focused on the alpha and beta variants; and McCallum et al. discuss the properties of the spike protein from the B1.1.427/B.1.429 (epsilon) variant. Together, these papers show a balance among mutations that enhance stability, those that increase binding to the human receptor ACE2, and those that confer resistance to neutralizing antibodies. Science , abi6226, abi9745, abi7994, this issue p. , p. , p.  Several fast-spreading variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have become the dominant circulating strains in the COVID-19 pandemic. We report here cryo–electron microscopy structures of the full-length spike (S) trimers of the B.1.1.7 and B.1.351 variants, as well as their biochemical and antigenic properties. Amino acid substitutions in the B.1.1.7 protein increase both the accessibility of its receptor binding domain and the binding affinity for receptor angiotensin-converting enzyme 2 (ACE2). The enhanced receptor engagement may account for the increased transmissibility. The B.1.351 variant has evolved to reshape antigenic surfaces of the major neutralizing sites on the S protein, making it resistant to some potent neutralizing antibodies. These findings provide structural details on how SARS-CoV-2 has evolved to enhance viral fitness and immune evasion. : /lookup/volpage/373/641?iss=6555 : /lookup/doi/10.1126/science.abi9745 : /lookup/doi/10.1126/science.abi7994
We strongly encourage you to read the rest of this document, scroll through the FAQ section, and train the test model from example/ directory. Importantly, please read our Database of Structural Propensities of Proteins paper. Proteins are complex biomolecules made of 20 building blocks, amino acids, which are connected sequentially into long non-branching chains; commonly known as polypeptide chains. Unique spatial arrangement of polypeptide chains yields 3D molecular structures, which define protein function and interactions with other biomolecules. Although the very basic forces that govern protein 3D structure formation are known and understood, the exact nature of polypeptide folding remains elusive and has been studied extensively.