Goto

Collaborating Authors

 learning mutational semantic


Learning Mutational Semantics

Neural Information Processing Systems

In many natural domains, changing a small part of an entity can transform its semantics; for example, a single word change can alter the meaning of a sentence, or a single amino acid change can mutate a viral protein to escape antiviral treatment or immunity. Although identifying such mutations can be desirable (for example, therapeutic design that anticipates avenues of viral escape), the rules governing semantic change are often hard to quantify. Here, we introduce the problem of identifying mutations with a large effect on semantics, but where valid mutations are under complex constraints (for example, English grammar or biological viability), which we refer to as constrained semantic change search (CSCS). We propose an unsupervised solution based on language models that simultaneously learn continuous latent representations. We report good empirical performance on CSCS of single-word mutations to news headlines, map a continuous semantic space of viral variation, and, notably, show unprecedented zero-shot prediction of single-residue escape mutations to key influenza and HIV proteins, suggesting a productive link between modeling natural language and pathogenic evolution.


Review for NeurIPS paper: Learning Mutational Semantics

Neural Information Processing Systems

Weaknesses: There are a few weaknesses that might be helpful to address (also see comment on Correctness): clarification of the notion of "grammaticality" for this problem; further connections to similar approaches for protein modeling that could be considered; the slightly ad hoc nature of the CSCS objective; and the fact that comparisons made to similarly high-capacity deep unsupervised models in the Appendix did not use viral data. These are explained in further detail below: Appropriateness of "grammaticality" for the viral immunological escape problem: I appreciate the trend of using massive amounts of unsupervised data to circumvent the difficulty of obtaining fitness measurements for biological sequences, which this work also advances. However, there is some degree of implicit supervision involved here, in that the amino acid sequences used (described in Lines 199-200) are explicitly from infectious viruses (rather than somehow being neutral/benign). It's also not clear that casting this observation as merely an issue of "grammaticality" makes sense: if a "grammatically correct" sequence is one that belongs to an infectious virus, what's the difference between grammaticality and semantics (which are also supposed to capture what makes a sequence infectious)? Perhaps one could claim that grammaticality in this context has to do with whether the protein is "valid", in the sense that it folds or is stable, but this is not explained, and does not absolve the first point that all the sequence data come from infectious viruses (rather than, for example, all valid protein variants that fold/are stable, which would allow for a much clearer distinction between grammaticality and semantics).


Learning Mutational Semantics

Neural Information Processing Systems

In many natural domains, changing a small part of an entity can transform its semantics; for example, a single word change can alter the meaning of a sentence, or a single amino acid change can mutate a viral protein to escape antiviral treatment or immunity. Although identifying such mutations can be desirable (for example, therapeutic design that anticipates avenues of viral escape), the rules governing semantic change are often hard to quantify. Here, we introduce the problem of identifying mutations with a large effect on semantics, but where valid mutations are under complex constraints (for example, English grammar or biological viability), which we refer to as constrained semantic change search (CSCS). We propose an unsupervised solution based on language models that simultaneously learn continuous latent representations. We report good empirical performance on CSCS of single-word mutations to news headlines, map a continuous semantic space of viral variation, and, notably, show unprecedented zero-shot prediction of single-residue escape mutations to key influenza and HIV proteins, suggesting a productive link between modeling natural language and pathogenic evolution.