binding specificity
A large language model for predicting T cell receptor-antigen binding specificity
Fang, Xing, Yu, Chenpeng, Tian, Shiye, Liu, Hui
The human immune response depends on the binding of T-cell receptors (TCRs) to antigens (pTCR), which elicits the T cells to eliminate viruses, tumor cells, and other pathogens. The ability of human immunity system responding to unknown viruses and bacteria stems from the TCR diversity. However, this vast diversity poses challenges on the TCR-antigen binding prediction methods. In this study, we propose a Masked Language Model (MLM), referred to as tcrLM, to overcome limitations in model generalization. Specifically, we randomly masked sequence segments and train tcrLM to infer the masked segment, thereby extract expressive feature from TCR sequences. Meanwhile, we introduced virtual adversarial training techniques to enhance the model's robustness. We built the largest TCR CDR3 sequence dataset to date (comprising 2,277,773,840 residuals), and pre-trained tcrLM on this dataset. Our extensive experimental results demonstrate that tcrLM achieved AUC values of 0.937 and 0.933 on independent test sets and external validation sets, respectively, which remarkably outperformed four previously published prediction methods. On a large-scale COVID-19 pTCR binding test set, our method outperforms the current state-of-the-art method by at least 8%, highlighting the generalizability of our method. Furthermore, we validated that our approach effectively predicts immunotherapy response and clinical outcomes on a clinical cohorts. These findings clearly indicate that tcrLM exhibits significant potential in predicting antigenic immunogenicity.
A unified cross-attention model for predicting antigen binding specificity to both HLA and TCR molecules
Yu, Chenpeng, Fang, Xing, Liu, Hui
The immune checkpoint inhibitors have demonstrated promising clinical efficacy across various tumor types, yet the percentage of patients who benefit from them remains low. The binding affinity between antigens and HLA-I/TCR molecules plays a critical role in antigen presentation and T-cell activation. Some computational methods have been developed to predict antigen-HLA or antigen-TCR binding specificity, but they focus solely on one task at a time. In this paper, we propose UnifyImmun, a unified cross-attention transformer model designed to simultaneously predicts the binding of antigens to both HLA and TCR molecules, thereby providing more comprehensive evaluation of antigen immunogenicity. We devise a two-phase progressive training strategy that enables these two tasks to mutually reinforce each other, by compelling the encoders to extract more expressive features. To further enhance the model generalizability, we incorporate virtual adversarial training. Compared to over ten existing methods for predicting antigen-HLA and antigen-TCR binding, our method demonstrates better performance in both tasks. Notably, on a large-scale COVID-19 antigen-TCR binding test set, our method improves performance by at least 9% compared to the current state-of-the-art methods. The validation experiments on three clinical cohorts confirm that our approach effectively predicts immunotherapy response and clinical outcomes. Furthermore, the cross-attention scores reveal the amino acids sites critical for antigen binding to receptors. In essence, our approach marks a significant step towards comprehensive evaluation of antigen immunogenicity.
Sequence-Based Nanobody-Antigen Binding Prediction
Sardar, Usama, Ali, Sarwan, Ayub, Muhammad Sohaib, Shoaib, Muhammad, Bashir, Khurram, Khan, Imdad Ullah, Patterson, Murray
Nanobodies (Nb) are monomeric heavy-chain fragments derived from heavy-chain only antibodies naturally found in Camelids and Sharks. Their considerably small size ( 3-4 nm; 13 kDa) and favorable biophysical properties make them attractive targets for recombinant production. Furthermore, their unique ability to bind selectively to specific antigens, such as toxins, chemicals, bacteria, and viruses, makes them powerful tools in cell biology, structural biology, medical diagnostics, and future therapeutic agents in treating cancer and other serious illnesses. However, a critical challenge in nanobodies production is the unavailability of nanobodies for a majority of antigens. Although some computational methods have been proposed to screen potential nanobodies for given target antigens, their practical application is highly restricted due to their reliance on 3D structures. Moreover, predicting nanobodyantigen interactions (binding) is a time-consuming and labor-intensive task. This study aims to develop a machine-learning method to predict Nanobody-Antigen binding solely based on the sequence data. We curated a comprehensive dataset of Nanobody-Antigen binding and nonbinding data and devised an embedding method based on gapped k-mers to predict binding based only on sequences of nanobody and antigen. Our approach achieves up to 90% accuracy in binding prediction and is significantly more efficient compared to the widely-used computational docking technique.