The language of a virus


Uncovering connections between seemingly unrelated branches of science might accelerate research in one branch by using the methods developed in the other branch as stepping stones. On page 284 of this issue, Hie et al. ([ 1 ][1]) provide an elegant example of such unexpected connections. The authors have uncovered a parallel between the properties of a virus and its interpretation by the host immune system and the properties of a sentence in natural language and its interpretation by a human. By leveraging an extensive natural language processing (NLP) toolbox ([ 2 ][2], [ 3 ][3]) developed over the years, they have come up with a powerful new method for the identification of mutations that allow a virus to escape from recognition by neutralizing antibodies. In 1950, Alan Turing predicted that machines will eventually compete with men in “intellectual fields” and suggested that one possible way forward would be to build a machine that can be taught to understand and speak English ([ 4 ][4]). This was, and still is, an ambitious goal. It is clear that language grammar can provide a formal skeleton for building sentences, but how can machines be trained to infer the meanings? In natural language, there are many ways to express the same idea, and yet small changes in expression can often change the meaning. Linguistics developed a way of quantifying the similarity of meaning (semantics). Specifically, it was proposed that words that are used in the same context are likely to have similar meanings ([ 5 ][5], [ 6 ][6]). This distributional hypothesis became a key feature for the computational technique in NLP, known as word (semantic) embedding. The main idea is to characterize words as vectors that represent distributional properties in a large amount of language data and then embed these sparse, high-dimensional vectors into more manageable, low-dimensional space in a distance-preserving manner. By the distributional hypothesis, this technique should group words that have similar semantics together in the embedding space. Hie et al. proposed that viruses can also be thought to have a grammar and semantics. Intuitively, the grammar describes which sequences make specific viruses (or their parts). Biologically, a viral protein sequence should have all the properties needed to invade a host, multiply, and continue invading another host. Thus, in some way, the grammar represents the fitness of a virus. With enough data, current machine learning approaches can be used to learn this sequence-based fitness function. ![Figure][7] Predicting immune escape The constrained semantic change search algorithm obtains semantic embeddings of all mutated protein sequences using bidirectional long short-term memory (LSTM). The sequences are ranked according to the combined score of the semantic change (the distance of a mutation from the original sequence) and fitness (the probability that a mutation appears in viral sequences). GRAPHIC: V. ALTOUNIAN/SCIENCE But what would be the meaning (semantics) of a virus? Hie et al. suggested that the semantics of a virus should be defined in terms of its recognition by immune systems. Specifically, viruses with different semantics would require a different state of the immune system (for example, different antibodies) to be recognized. The authors hypothesized that semantic embeddings allow sequences that require different immune responses to be uncovered. In this context, words represent protein sequences (or protein fragments), and recognition of such protein fragments is the task performed by the immune system. To escape immune responses, viral genomes can become mutated so that the virus evolves to no longer be recognized by the immune system. However, a virus that acquires a mutation that compromises its function (and thus fitness) will not survive. Using the NLP analogy, immune escape will be achieved by the mutations that change the semantics of the virus while maintaining its grammaticality so that the virus will remain infectious but escape the immune system. On the basis of this idea, Hie et al. developed a new approach, called constrained semantic change search (CSCS). Computationally, the goal of CSCS is to identify mutations that confer high fitness and substantial semantic changes at the same time (see the figure). The immune escape scores are computed by combining the two quantities. The search algorithm builds on a powerful deep learning technique for language modeling, called long short-term memory (LSTM), to obtain semantic embeddings of all mutated sequences and rank the sequences according to their immune escape scores in the embedded space. The semantic changes correspond to the distance of the mutated sequences to the original sequence in the semantic embedding, and its “grammaticality” (or fitness) is estimated by the probability that the mutation appears in viral sequences. The immune escape scores can then be computed by simultaneously considering both the semantic distance and fitness probability. Hie et al. confirmed their hypothesis for the correspondence of grammaticality and semantics to fitness and immune response in three viral proteins: influenza A hemagglutinin (HA), HIV-1 envelope (Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike. For the analogy of semantics to immune response, they found that clusters of semantically similar viruses were in good correspondence with virus subtypes, host, or both, confirming that the language model can extract functional meanings from protein sequences. The clustering patterns also revealed interspecies transmissibility and antigenic similarity. The correspondence of grammaticality to fitness was assessed more directly by using deep mutational scans evaluated for replication fitness (for HA and Env) or binding (for Spike). The combined model was tested against experimentally verified mutations that allow for immue escape. Scoring each amino acid residue with CSCS, the authors uncovered viral protein regions that are significantly enriched with escape potential: the head of HA for influenza, the V1/V2 hypervariable regions for HIV Env, and the receptor-binding domain (RBD) and amino-terminal domain for SARS-CoV-2 Spike. The language of viral evolution and escape proposed by Hie et al. provides a powerful framework for predicting mutations that lead to viral escape. However, interesting questions remain. Further extending the natural language analogy, it is notable that individuals can interpret the same English sentence differently depending on their past experience and the fluency in the language. Similarly, immune response differs between individuals depending on factors such as past pathogenic exposures and overall “strength” of the immune system. It will be interesting to see whether the proposed approach can be adapted to provide a “personalized” view of the language of virus evolution. 1. [↵][8]1. B. Hie, 2. E. Zhong, 3. B. Berger, 4. B. Bryson , Science 371, 284 (2021). [OpenUrl][9][Abstract/FREE Full Text][10] 2. [↵][11]1. L. Yann, 2. Y. Bengio, 3. G. Hinton , Nature 521, 436 (2015). [OpenUrl][12][CrossRef][13][PubMed][14] 3. [↵][15]1. T. Young, 2. D. Hazarika, 3. S. Poria, 4. E. Cambria , IEEE Comput. Intell. Mag. 13, 55 (2018). [OpenUrl][16] 4. [↵][17]1. A. Turing , Mind LIX, 433 (1950). 5. [↵][18]1. Z. S. Harris , Word 10, 146 (1954). [OpenUrl][19][CrossRef][20][PubMed][21] 6. [↵][22]1. J. R. Firth , in Studies in Linguistic Analysis (1957), pp. 1–32. Acknowledgments: The authors are supported by the Intramural Research Programs of the National Library of Medicine at the National Institutes of Health, USA. [1]: #ref-1 [2]: #ref-2 [3]: #ref-3 [4]: #ref-4 [5]: #ref-5 [6]: #ref-6 [7]: pending:yes [8]: #xref-ref-1-1 "View reference 1 in text" [9]: {openurl}?query=rft.jtitle%253DScience%26rft.stitle%253DScience%26rft.aulast%253DHie%26rft.auinit1%253DB.%26rft.volume%253D371%26rft.issue%253D6526%26rft.spage%253D284%26rft.epage%253D288%26rft.atitle%253DLearning%2Bthe%2Blanguage%2Bof%2Bviral%2Bevolution%2Band%2Bescape%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.abd7331%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [10]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNzEvNjUyNi8yODQiO3M6NDoiYXRvbSI7czoyMjoiL3NjaS8zNzEvNjUyNi8yMzMuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9 [11]: #xref-ref-2-1 "View reference 2 in text" [12]: {openurl}?query=rft.jtitle%253DNature%26rft.volume%253D521%26rft.spage%253D436%26rft_id%253Dinfo%253Adoi%252F10.1038%252Fnature14539%26rft_id%253Dinfo%253Apmid%252F26017442%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [13]: /lookup/external-ref?access_num=10.1038/nature14539&link_type=DOI [14]: /lookup/external-ref?access_num=26017442&link_type=MED&atom=%2Fsci%2F371%2F6526%2F233.atom [15]: #xref-ref-3-1 "View reference 3 in text" [16]: {openurl}?query=rft.jtitle%253DIEEE%2BComput.%2BIntell.%2BMag.%26rft.volume%253D13%26rft.spage%253D55%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [17]: #xref-ref-4-1 "View reference 4 in text" [18]: #xref-ref-5-1 "View reference 5 in text" [19]: {openurl}?query=rft.jtitle%253DWord%26rft.volume%253D10%26rft.spage%253D146%26rft_id%253Dinfo%253Adoi%252F10.1080%252F00437956.1954.11659520%26rft_id%253Dinfo%253Apmid%252F32513867%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [20]: /lookup/external-ref?access_num=10.1080/00437956.1954.11659520&link_type=DOI [21]: /lookup/external-ref?access_num=32513867&link_type=MED&atom=%2Fsci%2F371%2F6526%2F233.atom [22]: #xref-ref-6-1 "View reference 6 in text"

Duplicate Docs Excel Report

None found

Similar Docs  Excel Report  more

None found